Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
OpenDAS
nni
Commits
3d2abd4a
Unverified
Commit
3d2abd4a
authored
Aug 26, 2020
by
SparkSnail
Committed by
GitHub
Aug 26, 2020
Browse files
Fix remote & kubeflow it (#2828)
parent
9f44d54a
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
14 additions
and
13 deletions
+14
-13
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
...ng_service/remote_machine/remoteMachineTrainingService.ts
+13
-12
test/config/training_service.yml
test/config/training_service.yml
+1
-1
No files found.
src/nni_manager/training_service/remote_machine/remoteMachineTrainingService.ts
View file @
3d2abd4a
...
@@ -89,6 +89,19 @@ class RemoteMachineTrainingService implements TrainingService {
...
@@ -89,6 +89,19 @@ class RemoteMachineTrainingService implements TrainingService {
this
.
sshConnectionPromises
=
[];
this
.
sshConnectionPromises
=
[];
// initialize gpuScheduler
// initialize gpuScheduler
this
.
gpuScheduler
=
new
GPUScheduler
(
this
.
machineExecutorManagerMap
);
this
.
gpuScheduler
=
new
GPUScheduler
(
this
.
machineExecutorManagerMap
);
if
(
this
.
trialConfig
===
undefined
)
{
throw
new
Error
(
"
trial config not initialized!
"
);
}
// Copy codeDir to remote machine
for
(
const
[
rmMeta
,
executorManager
]
of
this
.
machineExecutorManagerMap
.
entries
())
{
const
executor
:
ShellExecutor
=
await
executorManager
.
getExecutor
(
this
.
initExecutorId
);
if
(
executor
!==
undefined
)
{
this
.
machineCopyExpCodeDirPromiseMap
.
set
(
rmMeta
,
executor
.
copyDirectoryToRemote
(
this
.
trialConfig
.
codeDir
,
executor
.
getRemoteCodePath
(
getExperimentId
()))
);
}
}
}
}
while
(
!
this
.
stopping
)
{
while
(
!
this
.
stopping
)
{
while
(
this
.
jobQueue
.
length
>
0
)
{
while
(
this
.
jobQueue
.
length
>
0
)
{
...
@@ -328,20 +341,8 @@ class RemoteMachineTrainingService implements TrainingService {
...
@@ -328,20 +341,8 @@ class RemoteMachineTrainingService implements TrainingService {
try
{
try
{
// Validate to make sure codeDir doesn't have too many files
// Validate to make sure codeDir doesn't have too many files
await
validateCodeDir
(
remoteMachineTrailConfig
.
codeDir
);
await
validateCodeDir
(
remoteMachineTrailConfig
.
codeDir
);
// Copy codeDir to remote machine
for
(
const
[
rmMeta
,
executorManager
]
of
this
.
machineExecutorManagerMap
.
entries
())
{
const
executor
:
ShellExecutor
=
await
executorManager
.
getExecutor
(
this
.
initExecutorId
);
if
(
executor
!==
undefined
)
{
this
.
machineCopyExpCodeDirPromiseMap
.
set
(
rmMeta
,
executor
.
copyDirectoryToRemote
(
remoteMachineTrailConfig
.
codeDir
,
executor
.
getRemoteCodePath
(
getExperimentId
()))
);
}
}
}
catch
(
error
)
{
}
catch
(
error
)
{
this
.
log
.
error
(
error
);
this
.
log
.
error
(
error
);
return
Promise
.
reject
(
new
Error
(
error
));
return
Promise
.
reject
(
new
Error
(
error
));
}
}
...
...
test/config/training_service.yml
View file @
3d2abd4a
...
@@ -10,7 +10,7 @@ kubeflow:
...
@@ -10,7 +10,7 @@ kubeflow:
kubeflowConfig
:
kubeflowConfig
:
operator
:
tf-operator
operator
:
tf-operator
apiVersion
:
v1
alpha2
apiVersion
:
v1
storage
:
azureStorage
storage
:
azureStorage
keyVault
:
keyVault
:
vaultName
:
vaultName
:
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment