Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
nni
Commits
c5acd8c2
Unverified
Commit
c5acd8c2
authored
May 27, 2019
by
SparkSnail
Committed by
GitHub
May 27, 2019
Browse files
Merge pull request #173 from microsoft/master
merge master
parents
40bae6e2
d135d184
Changes
93
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
200 additions
and
89 deletions
+200
-89
docs/zh_CN/assessors.rst
docs/zh_CN/assessors.rst
+2
-2
docs/zh_CN/automl_practice_sharing.rst
docs/zh_CN/automl_practice_sharing.rst
+8
-0
docs/zh_CN/builtinTuner.rst
docs/zh_CN/builtinTuner.rst
+0
-18
docs/zh_CN/builtin_assessor.rst
docs/zh_CN/builtin_assessor.rst
+9
-0
docs/zh_CN/builtin_tuner.rst
docs/zh_CN/builtin_tuner.rst
+18
-0
docs/zh_CN/community_sharings.rst
docs/zh_CN/community_sharings.rst
+12
-0
docs/zh_CN/contribution.rst
docs/zh_CN/contribution.rst
+2
-2
docs/zh_CN/examples.rst
docs/zh_CN/examples.rst
+12
-0
docs/zh_CN/index.rst
docs/zh_CN/index.rst
+6
-6
docs/zh_CN/nni_practice_sharing.rst
docs/zh_CN/nni_practice_sharing.rst
+10
-0
docs/zh_CN/reference.rst
docs/zh_CN/reference.rst
+1
-1
docs/zh_CN/training_services.rst
docs/zh_CN/training_services.rst
+1
-1
docs/zh_CN/tuners.rst
docs/zh_CN/tuners.rst
+3
-3
docs/zh_CN/tutorials.rst
docs/zh_CN/tutorials.rst
+16
-0
install.ps1
install.ps1
+1
-1
src/nni_manager/common/utils.ts
src/nni_manager/common/utils.ts
+8
-8
src/nni_manager/core/nnimanager.ts
src/nni_manager/core/nnimanager.ts
+9
-6
src/nni_manager/training_service/common/util.ts
src/nni_manager/training_service/common/util.ts
+48
-7
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
...ng_service/remote_machine/remoteMachineTrainingService.ts
+29
-30
src/nni_manager/training_service/remote_machine/sshClientUtility.ts
...nager/training_service/remote_machine/sshClientUtility.ts
+5
-4
No files found.
docs/zh_CN/assessors.rst
View file @
c5acd8c2
...
...
@@ -15,5 +15,5 @@ Assessor 从 Trial 中接收中间结果,并通过指定的算法决定此 Tri
.. toctree::
:maxdepth: 2
内置 Assessor<
b
uiltinAssessor>
自定义 Assessor<Customize
_
Assessor>
内置 Assessor<
B
uiltinAssessor>
自定义 Assessor<CustomizeAssessor>
docs/zh_CN/automl_practice_sharing.rst
0 → 100644
View file @
c5acd8c2
#################
自动机器学习的经验分享
#################
.. toctree::
:maxdepth: 2
神经网络架构搜索的对比<CommunitySharings/AutomlPracticeSharing/NasComparison>
docs/zh_CN/builtinTuner.rst
deleted
100644 → 0
View file @
40bae6e2
内置 Tuner
==================
.. toctree::
:maxdepth: 1
介绍<Builtin_Tuner>
TPE<hyperoptTuner>
Random Search<hyperoptTuner>
Anneal<hyperoptTuner>
Naive Evolution<evolutionTuner>
SMAC<smacTuner>
Batch Tuner<batchTuner>
Grid Search<gridsearchTuner>
Hyperband<hyperbandAdvisor>
Network Morphism<networkmorphismTuner>
Metis Tuner<metisTuner>
BOHB<bohbAdvisor>
\ No newline at end of file
docs/zh_CN/builtin
A
ssessor.rst
→
docs/zh_CN/builtin
_a
ssessor.rst
View file @
c5acd8c2
...
...
@@ -4,6 +4,6 @@
.. toctree::
:maxdepth: 1
介绍<Builtin_Assessors>
Medianstop<medianstopAssessor>
Curvefitting<curvefittingAssessor>
\ No newline at end of file
介绍<BuiltinAssessors>
Medianstop<MedianstopAssessor>
Curvefitting<CurvefittingAssessor>
\ No newline at end of file
docs/zh_CN/builtin_tuner.rst
0 → 100644
View file @
c5acd8c2
内置 Tuner
==================
.. toctree::
:maxdepth: 1
介绍<BuiltinTuner>
TPE<HyperoptTuner>
Random Search<HyperoptTuner>
Anneal<HyperoptTuner>
Naive Evolution<EvolutionTuner>
SMAC<SmacTuner>
Batch Tuner<BatchTuner>
Grid Search<GridsearchTuner>
Hyperband<HyperbandAdvisor>
Network Morphism<NetworkmorphismTuner>
Metis Tuner<MetisTuner>
BOHB<BohbAdvisor>
\ No newline at end of file
docs/zh_CN/community_sharings.rst
0 → 100644
View file @
c5acd8c2
######################
社区分享
######################
除了官方的教程和示例之外,也支持社区贡献者分享自己的自动机器学习实践经验,特别是使用 NNI 的实践经验。
.. toctree::
:maxdepth: 2
NNI 经验分享<nni_practice_sharing>
神经网络结构搜索的对比<CommunitySharings/NasComparison>
超参调优算法的对比<CommunitySharings/HpoComparison>
docs/zh_CN/
C
ontribution.rst
→
docs/zh_CN/
c
ontribution.rst
View file @
c5acd8c2
...
...
@@ -3,5 +3,5 @@
###############################
.. toctree::
设置开发环境<SetupNNIDeveloperEnvironment>
贡献指南<CONTRIBUTING>
\ No newline at end of file
设置开发环境<SetupNniDeveloperEnvironment>
贡献指南<Contributing>
\ No newline at end of file
docs/zh_CN/
E
xamples.rst
→
docs/zh_CN/
e
xamples.rst
View file @
c5acd8c2
...
...
@@ -5,8 +5,8 @@
.. toctree::
:maxdepth: 2
MNIST<
m
nist
_e
xamples>
Cifar10<
c
ifar10
_e
xamples>
Scikit-learn<
s
klearn
_e
xamples>
EvolutionSQuAD<S
QuAD_e
volution
_e
xamples>
GBDT<
g
bdt
_e
xample>
MNIST<
M
nist
E
xamples>
Cifar10<
C
ifar10
E
xamples>
Scikit-learn<
S
klearn
E
xamples>
EvolutionSQuAD<S
quadE
volution
E
xamples>
GBDT<
G
bdt
E
xample>
docs/zh_CN/index.rst
View file @
c5acd8c2
...
...
@@ -13,10 +13,10 @@ Neural Network Intelligence(NNI)文档
概述<Overview>
入门<QuickStart>
教程<
T
utorials>
样
例<
E
xamples>
参考<
R
eference>
教程<
t
utorials>
示
例<
e
xamples>
参考<
r
eference>
常见问答<FAQ>
贡献<
C
ontribution>
版本
日志<R
ELEASE
>
博客<Blog/index
>
贡献<
c
ontribution>
更改
日志<R
elease
>
社区经验分享<community_sharings
>
docs/zh_CN/nni_practice_sharing.rst
0 → 100644
View file @
c5acd8c2
#################
教程
#################
分享使用 NNI 来调优模型和系统的经验
.. toctree::
:maxdepth: 2
在 NNI 上调优 Recommenders 的 SVD<CommunitySharings/NniPracticeSharing/RecommendersSvd>
\ No newline at end of file
docs/zh_CN/
R
eference.rst
→
docs/zh_CN/
r
eference.rst
View file @
c5acd8c2
...
...
@@ -4,7 +4,7 @@
.. toctree::
:maxdepth: 3
命令行<N
NICTLDOC
>
命令行<N
nictl
>
Python API<sdk_reference>
Annotation<AnnotationSpec>
配置<ExperimentConfig>
...
...
docs/zh_CN/training_services.rst
View file @
c5acd8c2
...
...
@@ -4,6 +4,6 @@ NNI 支持的训练平台介绍
.. toctree::
本机<LocalMode>
远程<RemoteMachineMode>
OpenPAI<P
AI
Mode>
OpenPAI<P
ai
Mode>
Kubeflow<KubeflowMode>
FrameworkController<FrameworkControllerMode>
\ No newline at end of file
docs/zh_CN/tuners.rst
View file @
c5acd8c2
...
...
@@ -13,6 +13,6 @@ Tuner 从 Trial 接收指标结果,来评估一组超参或网络结构的性
.. toctree::
:maxdepth: 2
内置 Tuner<builtinTuner>
自定义 Tuner<Customize_Tuner>
自定义 Advisor<Customize_Advisor>
\ No newline at end of file
内置 Tuner<BuiltinTuner>
自定义 Tuner<CustomizeTuner>
自定义 Advisor<CustomizeAdvisor>
\ No newline at end of file
docs/zh_CN/tutorials.rst
0 → 100644
View file @
c5acd8c2
######################
教程
######################
.. toctree::
:maxdepth: 2
安装<Installation>
实现 Trial<Trials>
Tuner<tuners>
Assessor<assessors>
Web 界面<WebUI>
训练平台<training_services>
如何使用 Docker <HowToUseDocker>
高级功能<advanced>
如何调试<HowToDebug>
\ No newline at end of file
install.ps1
View file @
c5acd8c2
...
...
@@ -15,7 +15,7 @@ $yarnUrl = "https://yarnpkg.com/latest.tar.gz"
$unzipNodeDir
=
"node-v*"
$unzipYarnDir
=
"yarn-v*"
$NNI_DEPENDENCY_FOLDER
=
"C:\tmp\
$
env
:
USERNAME
"
$NNI_DEPENDENCY_FOLDER
=
[
System.IO.Path
]::
GetTempPath
()
+
$
env
:
USERNAME
$WHICH_PYTHON
=
where.exe
python
if
(
$WHICH_PYTHON
-eq
$null
){
...
...
src/nni_manager/common/utils.ts
View file @
c5acd8c2
...
...
@@ -43,11 +43,11 @@ function getExperimentRootDir(): string {
.
getLogDir
();
}
function
getLogDir
():
string
{
function
getLogDir
():
string
{
return
path
.
join
(
getExperimentRootDir
(),
'
log
'
);
}
function
getLogLevel
():
string
{
function
getLogLevel
():
string
{
return
getExperimentStartupInfo
()
.
getLogLevel
();
}
...
...
@@ -149,7 +149,7 @@ function parseArg(names: string[]): string {
return
''
;
}
function
encodeCmdLineArgs
(
args
:
any
):
any
{
function
encodeCmdLineArgs
(
args
:
any
):
any
{
if
(
process
.
platform
===
'
win32
'
){
return
JSON
.
stringify
(
args
);
}
...
...
@@ -158,7 +158,7 @@ function encodeCmdLineArgs(args:any):any{
}
}
function
getCmdPy
():
string
{
function
getCmdPy
():
string
{
let
cmd
=
'
python3
'
;
if
(
process
.
platform
===
'
win32
'
){
cmd
=
'
python
'
;
...
...
@@ -390,7 +390,7 @@ async function getVersion(): Promise<string> {
/**
* run command as ChildProcess
*/
function
getTunerProc
(
command
:
string
,
stdio
:
StdioOptions
,
newCwd
:
string
,
newEnv
:
any
):
ChildProcess
{
function
getTunerProc
(
command
:
string
,
stdio
:
StdioOptions
,
newCwd
:
string
,
newEnv
:
any
):
ChildProcess
{
let
cmd
:
string
=
command
;
let
arg
:
string
[]
=
[];
let
newShell
:
boolean
=
true
;
...
...
@@ -411,7 +411,7 @@ function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newE
/**
* judge whether the process is alive
*/
async
function
isAlive
(
pid
:
any
):
Promise
<
boolean
>
{
async
function
isAlive
(
pid
:
any
):
Promise
<
boolean
>
{
let
deferred
:
Deferred
<
boolean
>
=
new
Deferred
<
boolean
>
();
let
alive
:
boolean
=
false
;
if
(
process
.
platform
===
'
win32
'
){
...
...
@@ -439,7 +439,7 @@ async function isAlive(pid:any): Promise<boolean>{
/**
* kill process
*/
async
function
killPid
(
pid
:
any
):
Promise
<
void
>
{
async
function
killPid
(
pid
:
any
):
Promise
<
void
>
{
let
deferred
:
Deferred
<
void
>
=
new
Deferred
<
void
>
();
try
{
if
(
process
.
platform
===
"
win32
"
)
{
...
...
@@ -455,7 +455,7 @@ async function killPid(pid:any): Promise<void>{
return
deferred
.
promise
;
}
function
getNewLine
():
string
{
function
getNewLine
():
string
{
if
(
process
.
platform
===
"
win32
"
)
{
return
"
\r\n
"
;
}
...
...
src/nni_manager/core/nnimanager.ts
View file @
c5acd8c2
...
...
@@ -58,7 +58,8 @@ class NNIManager implements Manager {
private
status
:
NNIManagerStatus
;
private
waitingTrials
:
string
[];
private
trialJobs
:
Map
<
string
,
TrialJobDetail
>
;
private
trialJobMetricListener
:
(
metric
:
TrialJobMetric
)
=>
void
;
constructor
()
{
this
.
currSubmittedTrialNum
=
0
;
this
.
trialConcurrencyChange
=
0
;
...
...
@@ -76,6 +77,11 @@ class NNIManager implements Manager {
status
:
'
INITIALIZED
'
,
errors
:
[]
};
this
.
trialJobMetricListener
=
(
metric
:
TrialJobMetric
)
=>
{
this
.
onTrialJobMetrics
(
metric
).
catch
((
err
:
Error
)
=>
{
this
.
criticalError
(
NNIError
.
FromError
(
err
,
'
Job metrics error:
'
));
});
};
}
public
updateExperimentProfile
(
experimentProfile
:
ExperimentProfile
,
updateType
:
ProfileUpdateType
):
Promise
<
void
>
{
...
...
@@ -342,6 +348,7 @@ class NNIManager implements Manager {
if
(
this
.
dispatcher
===
undefined
)
{
throw
new
Error
(
'
Error: tuner has not been setup
'
);
}
this
.
trainingService
.
removeTrialJobMetricListener
(
this
.
trialJobMetricListener
);
this
.
dispatcher
.
sendCommand
(
TERMINATE
);
let
tunerAlive
:
boolean
=
true
;
// gracefully terminate tuner and assessor here, wait at most 30 seconds.
...
...
@@ -589,11 +596,7 @@ class NNIManager implements Manager {
if
(
this
.
dispatcher
===
undefined
)
{
throw
new
Error
(
'
Error: tuner or job maintainer have not been setup
'
);
}
this
.
trainingService
.
addTrialJobMetricListener
((
metric
:
TrialJobMetric
)
=>
{
this
.
onTrialJobMetrics
(
metric
).
catch
((
err
:
Error
)
=>
{
this
.
criticalError
(
NNIError
.
FromError
(
err
,
'
Job metrics error:
'
));
});
});
this
.
trainingService
.
addTrialJobMetricListener
(
this
.
trialJobMetricListener
);
this
.
dispatcher
.
onCommand
((
commandType
:
string
,
content
:
string
)
=>
{
this
.
onTunerCommand
(
commandType
,
content
).
catch
((
err
:
Error
)
=>
{
...
...
src/nni_manager/training_service/common/util.ts
View file @
c5acd8c2
...
...
@@ -24,7 +24,10 @@ import { getLogger } from "common/log";
import
{
countFilesRecursively
}
from
'
../../common/utils
'
import
*
as
cpp
from
'
child-process-promise
'
;
import
*
as
cp
from
'
child_process
'
;
import
{
GPU_INFO_COLLECTOR_FORMAT_LINUX
,
GPU_INFO_COLLECTOR_FORMAT_WINDOWS
}
from
'
./gpuData
'
import
*
as
os
from
'
os
'
;
import
*
as
fs
from
'
fs
'
;
import
{
getNewLine
}
from
'
../../common/utils
'
;
import
{
GPU_INFO_COLLECTOR_FORMAT_LINUX
,
GPU_INFO_COLLECTOR_FORMAT_WINDOWS
}
from
'
./gpuData
'
;
import
*
as
path
from
'
path
'
;
import
{
String
}
from
'
typescript-string-operations
'
;
import
{
file
}
from
"
../../node_modules/@types/tmp
"
;
...
...
@@ -66,6 +69,20 @@ export async function execMkdir(directory: string): Promise<void> {
return
Promise
.
resolve
();
}
/**
* copy files to the directory
* @param source
* @param destination
*/
export
async
function
execCopydir
(
source
:
string
,
destination
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
await
cpp
.
exec
(
`powershell.exe Copy-Item
${
source
}
-Destination
${
destination
}
-Recurse`
);
}
else
{
await
cpp
.
exec
(
`cp -r
${
source
}
${
destination
}
`
);
}
return
Promise
.
resolve
();
}
/**
* crete a new file
* @param filename
...
...
@@ -91,8 +108,6 @@ export function execScript(filePath: string): cp.ChildProcess {
}
}
/**
* output the last line of a file
* @param filePath
...
...
@@ -111,9 +126,9 @@ export async function execTail(filePath: string): Promise<cpp.childProcessPromis
* delete a directory
* @param directory
*/
export
async
function
execRemove
(
directory
:
string
):
Promise
<
void
>
{
export
async
function
execRemove
(
directory
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
await
cpp
.
exec
(
`powershell.exe Remove-Item
${
directory
}
`
);
await
cpp
.
exec
(
`powershell.exe Remove-Item
${
directory
}
-Recurse -Force
`
);
}
else
{
await
cpp
.
exec
(
`rm -rf
${
directory
}
`
);
}
...
...
@@ -124,7 +139,7 @@ export async function execRemove(directory: string): Promise<void>{
* kill a process
* @param directory
*/
export
async
function
execKill
(
pid
:
string
):
Promise
<
void
>
{
export
async
function
execKill
(
pid
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
await
cpp
.
exec
(
`cmd /c taskkill /PID
${
pid
}
/T /F`
);
}
else
{
...
...
@@ -138,7 +153,7 @@ export async function execKill(pid: string): Promise<void>{
* @param variable
* @returns command string
*/
export
function
setEnvironmentVariable
(
variable
:
{
key
:
string
;
value
:
string
}):
string
{
export
function
setEnvironmentVariable
(
variable
:
{
key
:
string
;
value
:
string
}):
string
{
if
(
process
.
platform
===
'
win32
'
)
{
return
`$env:
${
variable
.
key
}
="
${
variable
.
value
}
"`
;
}
...
...
@@ -147,6 +162,32 @@ export function setEnvironmentVariable(variable: { key: string; value: string })
}
}
/**
* Compress files in directory to tar file
* @param source_path
* @param tar_path
*/
export
async
function
tarAdd
(
tar_path
:
string
,
source_path
:
string
):
Promise
<
void
>
{
if
(
process
.
platform
===
'
win32
'
)
{
tar_path
=
tar_path
.
split
(
'
\\
'
).
join
(
'
\\\\
'
);
source_path
=
source_path
.
split
(
'
\\
'
).
join
(
'
\\\\
'
);
let
script
:
string
[]
=
[];
script
.
push
(
`import os`
,
`import tarfile`
,
String
.
Format
(
`tar = tarfile.open("{0}","w:gz")\r\nfor root,dir,files in os.walk("{1}"):`
,
tar_path
,
source_path
),
` for file in files:`
,
` fullpath = os.path.join(root,file)`
,
` tar.add(fullpath, arcname=file)`
,
`tar.close()`
);
await
fs
.
promises
.
writeFile
(
path
.
join
(
os
.
tmpdir
(),
'
tar.py
'
),
script
.
join
(
getNewLine
()),
{
encoding
:
'
utf8
'
,
mode
:
0o777
});
const
tarScript
:
string
=
path
.
join
(
os
.
tmpdir
(),
'
tar.py
'
);
await
cpp
.
exec
(
`python
${
tarScript
}
`
);
}
else
{
await
cpp
.
exec
(
`tar -czf
${
tar_path
}
-C
${
source_path
}
.`
);
}
return
Promise
.
resolve
();
}
/**
* generate script file name
...
...
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
View file @
c5acd8c2
...
...
@@ -36,7 +36,7 @@ import { ObservableTimer } from '../../common/observableTimer';
import
{
HostJobApplicationForm
,
HyperParameters
,
JobApplicationForm
,
TrainingService
,
TrialJobApplicationForm
,
TrialJobDetail
,
TrialJobMetric
,
NNIManagerIpConfig
}
from
'
../../common/trainingService
'
;
import
{
delay
,
generateParamFileName
,
getExperimentRootDir
,
uniqueString
,
getJobCancelStatus
,
getRemoteTmpDir
,
getIPV4Address
}
from
'
../../common/utils
'
;
import
{
delay
,
generateParamFileName
,
getExperimentRootDir
,
uniqueString
,
getJobCancelStatus
,
getRemoteTmpDir
,
getIPV4Address
,
getVersion
,
unixPathJoin
}
from
'
../../common/utils
'
;
import
{
GPUSummary
}
from
'
../common/gpuData
'
;
import
{
TrialConfig
}
from
'
../common/trialConfig
'
;
import
{
TrialConfigMetadataKey
}
from
'
../common/trialConfigMetadataKey
'
;
...
...
@@ -48,10 +48,9 @@ import {
}
from
'
./remoteMachineData
'
;
import
{
GPU_INFO_COLLECTOR_FORMAT_LINUX
}
from
'
../common/gpuData
'
;
import
{
SSHClientUtility
}
from
'
./sshClientUtility
'
;
import
{
validateCodeDir
}
from
'
../common/util
'
;
import
{
validateCodeDir
,
execRemove
,
execMkdir
,
execCopydir
}
from
'
../common/util
'
;
import
{
RemoteMachineJobRestServer
}
from
'
./remoteMachineJobRestServer
'
;
import
{
CONTAINER_INSTALL_NNI_SHELL_FORMAT
}
from
'
../common/containerJobData
'
;
import
{
mkDirP
,
getVersion
}
from
'
../../common/utils
'
;
/**
* Training Service implementation for Remote Machine (Linux)
...
...
@@ -234,7 +233,7 @@ class RemoteMachineTrainingService implements TrainingService {
}
else
if
(
form
.
jobType
===
'
TRIAL
'
)
{
// Generate trial job id(random)
const
trialJobId
:
string
=
uniqueString
(
5
);
const
trialWorkingFolder
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialWorkingFolder
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialJobDetail
:
RemoteMachineTrialJobDetail
=
new
RemoteMachineTrialJobDetail
(
trialJobId
,
...
...
@@ -354,7 +353,7 @@ class RemoteMachineTrainingService implements TrainingService {
case
TrialConfigMetadataKey
.
MACHINE_LIST
:
await
this
.
setupConnections
(
value
);
//remove local temp files
await
cpp
.
exec
(
`rm -rf
${
this
.
getLocalGpuMetricCollectorDir
()
}
`
);
await
execRemove
(
this
.
getLocalGpuMetricCollectorDir
());
break
;
case
TrialConfigMetadataKey
.
TRIAL_CONFIG
:
const
remoteMachineTrailConfig
:
TrialConfig
=
<
TrialConfig
>
JSON
.
parse
(
value
);
...
...
@@ -417,7 +416,7 @@ class RemoteMachineTrainingService implements TrainingService {
private
async
cleanupConnections
():
Promise
<
void
>
{
try
{
for
(
const
[
rmMeta
,
sshClientManager
]
of
this
.
machineSSHClientMap
.
entries
())
{
let
jobpidPath
:
string
=
p
ath
.
j
oin
(
this
.
getRemoteScriptsPath
(
rmMeta
.
username
),
'
pid
'
);
let
jobpidPath
:
string
=
unixP
ath
J
oin
(
this
.
getRemoteScriptsPath
(
rmMeta
.
username
),
'
pid
'
);
let
client
:
Client
|
undefined
=
sshClientManager
.
getFirstSSHClient
();
if
(
client
)
{
await
SSHClientUtility
.
remoteExeCommand
(
`pkill -P
\`
cat
${
jobpidPath
}
\`
`
,
client
);
...
...
@@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService {
*/
private
getLocalGpuMetricCollectorDir
():
string
{
let
userName
:
string
=
path
.
basename
(
os
.
homedir
());
//get current user name of os
return
`
${
os
.
tmpdir
()
}
/
${
userName
}
/nni/
scripts
/`
;
return
path
.
join
(
os
.
tmpdir
()
,
userName
,
'
nni
'
,
'
scripts
'
)
;
}
/**
...
...
@@ -447,14 +446,14 @@ class RemoteMachineTrainingService implements TrainingService {
*/
private
async
generateGpuMetricsCollectorScript
(
userName
:
string
):
Promise
<
void
>
{
let
gpuMetricCollectorScriptFolder
:
string
=
this
.
getLocalGpuMetricCollectorDir
();
await
cpp
.
exec
(
`m
kdir
-p
${
path
.
join
(
gpuMetricCollectorScriptFolder
,
userName
)
}
`
);
await
exec
M
kdir
(
path
.
join
(
gpuMetricCollectorScriptFolder
,
userName
));
//generate gpu_metrics_collector.sh
let
gpuMetricsCollectorScriptPath
:
string
=
path
.
join
(
gpuMetricCollectorScriptFolder
,
userName
,
'
gpu_metrics_collector.sh
'
);
const
remoteGPUScriptsDir
:
string
=
this
.
getRemoteScriptsPath
(
userName
);
// This directory is used to store gpu_metrics and pid created by script
const
gpuMetricsCollectorScriptContent
:
string
=
String
.
Format
(
GPU_INFO_COLLECTOR_FORMAT_LINUX
,
remoteGPUScriptsDir
,
p
ath
.
j
oin
(
remoteGPUScriptsDir
,
'
pid
'
),
unixP
ath
J
oin
(
remoteGPUScriptsDir
,
'
pid
'
),
);
await
fs
.
promises
.
writeFile
(
gpuMetricsCollectorScriptPath
,
gpuMetricsCollectorScriptContent
,
{
encoding
:
'
utf8
'
});
}
...
...
@@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService {
private
async
initRemoteMachineOnConnected
(
rmMeta
:
RemoteMachineMeta
,
conn
:
Client
):
Promise
<
void
>
{
// Create root working directory after ssh connection is ready
await
this
.
generateGpuMetricsCollectorScript
(
rmMeta
.
username
);
//generate gpu script in local machine first, will copy to remote machine later
const
nniRootDir
:
string
=
`
${
os
.
t
mp
d
ir
(
)}
/
nni
`
;
const
nniRootDir
:
string
=
unixPathJoin
(
getRemoteT
mp
D
ir
(
this
.
remoteOS
),
'
nni
'
)
;
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
this
.
remoteExpRootDir
}
`
,
conn
);
// Copy NNI scripts to remote expeirment working directory
...
...
@@ -490,15 +489,15 @@ class RemoteMachineTrainingService implements TrainingService {
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
remoteGpuScriptCollectorDir
}
`
,
conn
);
await
SSHClientUtility
.
remoteExeCommand
(
`chmod 777
${
nniRootDir
}
${
nniRootDir
}
/*
${
nniRootDir
}
/scripts/*`
,
conn
);
//copy gpu_metrics_collector.sh to remote
await
SSHClientUtility
.
copyFileToRemote
(
path
.
join
(
localGpuScriptCollectorDir
,
rmMeta
.
username
,
'
gpu_metrics_collector.sh
'
),
p
ath
.
j
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
),
conn
);
await
SSHClientUtility
.
copyFileToRemote
(
path
.
join
(
localGpuScriptCollectorDir
,
rmMeta
.
username
,
'
gpu_metrics_collector.sh
'
),
unixP
ath
J
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
),
conn
);
//Begin to execute gpu_metrics_collection scripts
SSHClientUtility
.
remoteExeCommand
(
`bash
${
p
ath
.
j
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
)}
`
,
conn
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
unixP
ath
J
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics_collector.sh
'
)}
`
,
conn
);
this
.
timer
.
subscribe
(
async
(
tick
:
number
)
=>
{
const
cmdresult
:
RemoteCommandResult
=
await
SSHClientUtility
.
remoteExeCommand
(
`tail -n 1
${
p
ath
.
j
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics
'
)}
`
,
conn
);
`tail -n 1
${
unixP
ath
J
oin
(
remoteGpuScriptCollectorDir
,
'
gpu_metrics
'
)}
`
,
conn
);
if
(
cmdresult
&&
cmdresult
.
stdout
)
{
rmMeta
.
gpuSummary
=
<
GPUSummary
>
JSON
.
parse
(
cmdresult
.
stdout
);
}
...
...
@@ -531,7 +530,7 @@ class RemoteMachineTrainingService implements TrainingService {
}
else
if
(
rmScheduleResult
.
resultType
===
ScheduleResultType
.
SUCCEED
&&
rmScheduleResult
.
scheduleInfo
!==
undefined
)
{
const
rmScheduleInfo
:
RemoteMachineScheduleInfo
=
rmScheduleResult
.
scheduleInfo
;
const
trialWorkingFolder
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialWorkingFolder
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
trialJobDetail
.
rmMeta
=
rmScheduleInfo
.
rmMeta
;
...
...
@@ -575,7 +574,7 @@ class RemoteMachineTrainingService implements TrainingService {
const
trialLocalTempFolder
:
string
=
path
.
join
(
this
.
expRootDir
,
'
trials-local
'
,
trialJobId
);
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
trialWorkingFolder
}
`
,
sshClient
);
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
p
ath
.
j
oin
(
trialWorkingFolder
,
'
.nni
'
)}
`
,
sshClient
);
await
SSHClientUtility
.
remoteExeCommand
(
`mkdir -p
${
unixP
ath
J
oin
(
trialWorkingFolder
,
'
.nni
'
)}
`
,
sshClient
);
// RemoteMachineRunShellFormat is the run shell format string,
// See definition in remoteMachineData.ts
...
...
@@ -603,20 +602,20 @@ class RemoteMachineTrainingService implements TrainingService {
getExperimentId
(),
trialJobDetail
.
sequenceId
.
toString
(),
this
.
isMultiPhase
,
p
ath
.
j
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
jobpid
'
),
unixP
ath
J
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
jobpid
'
),
command
,
nniManagerIp
,
this
.
remoteRestServerPort
,
version
,
this
.
logCollection
,
p
ath
.
j
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
code
'
)
unixP
ath
J
oin
(
trialWorkingFolder
,
'
.nni
'
,
'
code
'
)
)
//create tmp trial working folder locally.
await
cpp
.
exec
(
`m
kdir
-p
${
path
.
join
(
trialLocalTempFolder
,
'
.nni
'
)
}
`
);
await
exec
M
kdir
(
path
.
join
(
trialLocalTempFolder
,
'
.nni
'
));
//create tmp trial working folder locally.
await
cpp
.
exec
(
`cp -r
${
this
.
trialConfig
.
codeDir
}
/*
${
trialLocalTempFolder
}
`
);
await
execCopydir
(
path
.
join
(
this
.
trialConfig
.
codeDir
,
'
*
'
),
trialLocalTempFolder
);
const
installScriptContent
:
string
=
CONTAINER_INSTALL_NNI_SHELL_FORMAT
;
// Write NNI installation file to local tmp files
await
fs
.
promises
.
writeFile
(
path
.
join
(
trialLocalTempFolder
,
'
install_nni.sh
'
),
installScriptContent
,
{
encoding
:
'
utf8
'
});
...
...
@@ -626,7 +625,7 @@ class RemoteMachineTrainingService implements TrainingService {
// Copy files in codeDir to remote working directory
await
SSHClientUtility
.
copyDirectoryToRemote
(
trialLocalTempFolder
,
trialWorkingFolder
,
sshClient
,
this
.
remoteOS
);
// Execute command in remote machine
SSHClientUtility
.
remoteExeCommand
(
`bash
${
p
ath
.
j
oin
(
trialWorkingFolder
,
'
run.sh
'
)}
`
,
sshClient
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
unixP
ath
J
oin
(
trialWorkingFolder
,
'
run.sh
'
)}
`
,
sshClient
);
}
private
async
runHostJob
(
form
:
HostJobApplicationForm
):
Promise
<
TrialJobDetail
>
{
...
...
@@ -646,8 +645,8 @@ class RemoteMachineTrainingService implements TrainingService {
);
await
fs
.
promises
.
writeFile
(
path
.
join
(
localDir
,
'
run.sh
'
),
runScriptContent
,
{
encoding
:
'
utf8
'
});
await
SSHClientUtility
.
copyFileToRemote
(
path
.
join
(
localDir
,
'
run.sh
'
),
p
ath
.
j
oin
(
remoteDir
,
'
run.sh
'
),
sshClient
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
p
ath
.
j
oin
(
remoteDir
,
'
run.sh
'
)}
`
,
sshClient
);
path
.
join
(
localDir
,
'
run.sh
'
),
unixP
ath
J
oin
(
remoteDir
,
'
run.sh
'
),
sshClient
);
SSHClientUtility
.
remoteExeCommand
(
`bash
${
unixP
ath
J
oin
(
remoteDir
,
'
run.sh
'
)}
`
,
sshClient
);
const
jobDetail
:
RemoteMachineTrialJobDetail
=
new
RemoteMachineTrialJobDetail
(
jobId
,
'
RUNNING
'
,
Date
.
now
(),
remoteDir
,
form
,
this
.
generateSequenceId
()
...
...
@@ -672,7 +671,7 @@ class RemoteMachineTrainingService implements TrainingService {
private
async
updateTrialJobStatus
(
trialJob
:
RemoteMachineTrialJobDetail
,
sshClient
:
Client
):
Promise
<
TrialJobDetail
>
{
const
deferred
:
Deferred
<
TrialJobDetail
>
=
new
Deferred
<
TrialJobDetail
>
();
const
jobpidPath
:
string
=
this
.
getJobPidPath
(
trialJob
.
id
);
const
trialReturnCodeFilePath
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJob
.
id
,
'
.nni
'
,
'
code
'
);
const
trialReturnCodeFilePath
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJob
.
id
,
'
.nni
'
,
'
code
'
);
try
{
const
killResult
:
number
=
(
await
SSHClientUtility
.
remoteExeCommand
(
`kill -0
\`
cat
${
jobpidPath
}
\`
`
,
sshClient
)).
exitCode
;
// if the process of jobpid is not alive any more
...
...
@@ -712,15 +711,15 @@ class RemoteMachineTrainingService implements TrainingService {
}
private
getRemoteScriptsPath
(
userName
:
string
):
string
{
return
p
ath
.
j
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
userName
,
'
nni
'
,
'
scripts
'
);
return
unixP
ath
J
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
userName
,
'
nni
'
,
'
scripts
'
);
}
private
getHostJobRemoteDir
(
jobId
:
string
):
string
{
return
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
hostjobs
'
,
jobId
);
return
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
hostjobs
'
,
jobId
);
}
private
getRemoteExperimentRootDir
():
string
{
return
p
ath
.
j
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
'
nni
'
,
'
experiments
'
,
getExperimentId
());
return
unixP
ath
J
oin
(
getRemoteTmpDir
(
this
.
remoteOS
),
'
nni
'
,
'
experiments
'
,
getExperimentId
());
}
public
get
MetricsEmitter
()
:
EventEmitter
{
...
...
@@ -735,9 +734,9 @@ class RemoteMachineTrainingService implements TrainingService {
let
jobpidPath
:
string
;
if
(
trialJobDetail
.
form
.
jobType
===
'
TRIAL
'
)
{
jobpidPath
=
p
ath
.
j
oin
(
trialJobDetail
.
workingDirectory
,
'
.nni
'
,
'
jobpid
'
);
jobpidPath
=
unixP
ath
J
oin
(
trialJobDetail
.
workingDirectory
,
'
.nni
'
,
'
jobpid
'
);
}
else
if
(
trialJobDetail
.
form
.
jobType
===
'
HOST
'
)
{
jobpidPath
=
p
ath
.
j
oin
(
this
.
getHostJobRemoteDir
(
jobId
),
'
jobpid
'
);
jobpidPath
=
unixP
ath
J
oin
(
this
.
getHostJobRemoteDir
(
jobId
),
'
jobpid
'
);
}
else
{
throw
new
Error
(
`Job type not supported:
${
trialJobDetail
.
form
.
jobType
}
`
);
}
...
...
@@ -751,14 +750,14 @@ class RemoteMachineTrainingService implements TrainingService {
throw
new
Error
(
'
sshClient is undefined.
'
);
}
const
trialWorkingFolder
:
string
=
p
ath
.
j
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialWorkingFolder
:
string
=
unixP
ath
J
oin
(
this
.
remoteExpRootDir
,
'
trials
'
,
trialJobId
);
const
trialLocalTempFolder
:
string
=
path
.
join
(
this
.
expRootDir
,
'
trials-local
'
,
trialJobId
);
const
fileName
:
string
=
generateParamFileName
(
hyperParameters
);
const
localFilepath
:
string
=
path
.
join
(
trialLocalTempFolder
,
fileName
);
await
fs
.
promises
.
writeFile
(
localFilepath
,
hyperParameters
.
value
,
{
encoding
:
'
utf8
'
});
await
SSHClientUtility
.
copyFileToRemote
(
localFilepath
,
p
ath
.
j
oin
(
trialWorkingFolder
,
fileName
),
sshClient
);
await
SSHClientUtility
.
copyFileToRemote
(
localFilepath
,
unixP
ath
J
oin
(
trialWorkingFolder
,
fileName
),
sshClient
);
}
private
generateSequenceId
():
number
{
...
...
src/nni_manager/training_service/remote_machine/sshClientUtility.ts
View file @
c5acd8c2
...
...
@@ -28,8 +28,9 @@ import * as stream from 'stream';
import
{
Deferred
}
from
'
ts-deferred
'
;
import
{
NNIError
,
NNIErrorNames
}
from
'
../../common/errors
'
;
import
{
getLogger
,
Logger
}
from
'
../../common/log
'
;
import
{
uniqueString
,
getRemoteTmpDir
}
from
'
../../common/utils
'
;
import
{
uniqueString
,
getRemoteTmpDir
,
unixPathJoin
}
from
'
../../common/utils
'
;
import
{
RemoteCommandResult
}
from
'
./remoteMachineData
'
;
import
{
execRemove
,
tarAdd
}
from
'
../common/util
'
;
/**
*
...
...
@@ -47,13 +48,13 @@ export namespace SSHClientUtility {
const
deferred
:
Deferred
<
void
>
=
new
Deferred
<
void
>
();
const
tmpTarName
:
string
=
`
${
uniqueString
(
10
)}
.tar.gz`
;
const
localTarPath
:
string
=
path
.
join
(
os
.
tmpdir
(),
tmpTarName
);
const
remoteTarPath
:
string
=
p
ath
.
j
oin
(
getRemoteTmpDir
(
remoteOS
),
tmpTarName
);
const
remoteTarPath
:
string
=
unixP
ath
J
oin
(
getRemoteTmpDir
(
remoteOS
),
tmpTarName
);
// Compress files in local directory to experiment root directory
await
cpp
.
exec
(
`tar -czf
${
localTarPath
}
-C
${
localDirectory
}
.`
);
await
tarAdd
(
localTarPath
,
localDirectory
);
// Copy the compressed file to remoteDirectory and delete it
await
copyFileToRemote
(
localTarPath
,
remoteTarPath
,
sshClient
);
await
cpp
.
exec
(
`rm
${
localTarPath
}
`
);
await
execRemove
(
localTarPath
);
// Decompress the remote compressed file in and delete it
await
remoteExeCommand
(
`tar -oxzf
${
remoteTarPath
}
-C
${
remoteDirectory
}
`
,
sshClient
);
await
remoteExeCommand
(
`rm
${
remoteTarPath
}
`
,
sshClient
);
...
...
Prev
1
2
3
4
5
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment