Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
nni
Commits
ba8dccd6
Commit
ba8dccd6
authored
Jun 23, 2019
by
suiguoxin
Browse files
Merge branch 'master' of
https://github.com/microsoft/nni
parents
56a1575b
150ee83a
Changes
208
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
147 additions
and
107 deletions
+147
-107
examples/tuners/ga_customer_tuner/README.md
examples/tuners/ga_customer_tuner/README.md
+2
-2
examples/tuners/weight_sharing/ga_customer_tuner/README.md
examples/tuners/weight_sharing/ga_customer_tuner/README.md
+2
-2
src/nni_manager/common/restServer.ts
src/nni_manager/common/restServer.ts
+5
-5
src/nni_manager/common/trainingService.ts
src/nni_manager/common/trainingService.ts
+1
-2
src/nni_manager/common/utils.ts
src/nni_manager/common/utils.ts
+41
-7
src/nni_manager/config/frameworkcontroller/frameworkcontrollerjob-crd-v1.json
...ig/frameworkcontroller/frameworkcontrollerjob-crd-v1.json
+8
-8
src/nni_manager/config/kubeflow/pytorchjob-crd-v1alpha2.json
src/nni_manager/config/kubeflow/pytorchjob-crd-v1alpha2.json
+8
-8
src/nni_manager/config/kubeflow/pytorchjob-crd-v1beta1.json
src/nni_manager/config/kubeflow/pytorchjob-crd-v1beta1.json
+8
-8
src/nni_manager/config/kubeflow/tfjob-crd-v1alpha2.json
src/nni_manager/config/kubeflow/tfjob-crd-v1alpha2.json
+8
-8
src/nni_manager/config/kubeflow/tfjob-crd-v1beta1.json
src/nni_manager/config/kubeflow/tfjob-crd-v1beta1.json
+8
-8
src/nni_manager/core/nnimanager.ts
src/nni_manager/core/nnimanager.ts
+2
-2
src/nni_manager/core/test/ipcInterface.test.ts
src/nni_manager/core/test/ipcInterface.test.ts
+3
-3
src/nni_manager/core/test/mockedTrainingService.ts
src/nni_manager/core/test/mockedTrainingService.ts
+2
-2
src/nni_manager/core/test/nnimanager.test.ts
src/nni_manager/core/test/nnimanager.test.ts
+1
-1
src/nni_manager/package.json
src/nni_manager/package.json
+1
-0
src/nni_manager/rest_server/nniRestServer.ts
src/nni_manager/rest_server/nniRestServer.ts
+1
-1
src/nni_manager/rest_server/restHandler.ts
src/nni_manager/rest_server/restHandler.ts
+1
-1
src/nni_manager/rest_server/restValidationSchemas.ts
src/nni_manager/rest_server/restValidationSchemas.ts
+1
-1
src/nni_manager/training_service/common/clusterJobRestServer.ts
...i_manager/training_service/common/clusterJobRestServer.ts
+41
-35
src/nni_manager/training_service/common/containerJobData.ts
src/nni_manager/training_service/common/containerJobData.ts
+3
-3
No files found.
examples/tuners/ga_customer_tuner/README.md
View file @
ba8dccd6
# How to use ga_customer_tuner?
This tuner is a customized tuner which only suitable for trial whose code path is "~/nni/examples/trials/ga_squad",
This tuner is a customized tuner which only suitable for trial whose code path is "~/nni/examples/trials/ga_squad",
type
`cd ~/nni/examples/trials/ga_squad`
and check readme.md to get more information for ga_squad trial.
# config
# config
If you want to use ga_customer_tuner in your experiment, you could set config file as following format:
```
...
...
examples/tuners/weight_sharing/ga_customer_tuner/README.md
View file @
ba8dccd6
# How to use ga_customer_tuner?
This tuner is a customized tuner which only suitable for trial whose code path is "~/nni/examples/trials/ga_squad",
This tuner is a customized tuner which only suitable for trial whose code path is "~/nni/examples/trials/ga_squad",
type
`cd ~/nni/examples/trials/ga_squad`
and check readme.md to get more information for ga_squad trial.
# config
# config
If you want to use ga_customer_tuner in your experiment, you could set config file as following format:
```
...
...
src/nni_manager/common/restServer.ts
View file @
ba8dccd6
...
...
@@ -29,7 +29,7 @@ import { getBasePort } from './experimentStartupInfo';
/**
* Abstraction class to create a RestServer
* The module who wants to use a RestServer could <b>extends</b> this abstract class
* The module who wants to use a RestServer could <b>extends</b> this abstract class
* And implement its own registerRestHandler() function to register routers
*/
export
abstract
class
RestServer
{
...
...
@@ -43,7 +43,7 @@ export abstract class RestServer {
protected
app
:
express
.
Application
=
express
();
protected
log
:
Logger
=
getLogger
();
protected
basePort
?:
number
;
constructor
()
{
this
.
port
=
getBasePort
();
assert
(
this
.
port
&&
this
.
port
>
1024
);
...
...
@@ -91,9 +91,9 @@ export abstract class RestServer {
}
else
{
this
.
startTask
.
promise
.
then
(
()
=>
{
// Started
//Stops the server from accepting new connections and keeps existing connections.
//This function is asynchronous, the server is finally closed when all connections
//are ended and the server emits a 'close' event.
//Stops the server from accepting new connections and keeps existing connections.
//This function is asynchronous, the server is finally closed when all connections
//are ended and the server emits a 'close' event.
//Refer https://nodejs.org/docs/latest/api/net.html#net_server_close_callback
this
.
server
.
close
().
on
(
'
close
'
,
()
=>
{
this
.
log
.
info
(
'
Rest server stopped.
'
);
...
...
src/nni_manager/common/trainingService.ts
View file @
ba8dccd6
...
...
@@ -91,6 +91,7 @@ interface TrialJobMetric {
* define TrainingServiceError
*/
class
TrainingServiceError
extends
Error
{
private
errCode
:
number
;
constructor
(
errorCode
:
number
,
errorMessage
:
string
)
{
...
...
@@ -136,5 +137,3 @@ export {
TrainingServiceMetadata
,
TrialJobDetail
,
TrialJobMetric
,
HyperParameters
,
HostJobApplicationForm
,
JobApplicationForm
,
JobType
,
NNIManagerIpConfig
};
src/nni_manager/common/utils.ts
View file @
ba8dccd6
...
...
@@ -167,7 +167,7 @@ function getCmdPy(): string {
}
/**
* Generate command line to start automl algorithm(s),
* Generate command line to start automl algorithm(s),
* either start advisor or start a process which runs tuner and assessor
* @param tuner : For builtin tuner:
* {
...
...
@@ -361,11 +361,11 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
if
(
process
.
platform
===
"
win32
"
)
{
cmd
=
`powershell "Get-ChildItem -Path
${
directory
}
-Recurse -File | Measure-Object | %{$_.Count}"`
}
else
{
cmd
=
`find
${
directory
}
-type f | wc -l`
;
cmd
=
`find
${
directory
}
-type f | wc -l`
;
}
cpp
.
exec
(
cmd
).
then
((
result
)
=>
{
if
(
result
.
stdout
&&
parseInt
(
result
.
stdout
))
{
fileCount
=
parseInt
(
result
.
stdout
);
fileCount
=
parseInt
(
result
.
stdout
);
}
deferred
.
resolve
(
fileCount
);
});
...
...
@@ -374,6 +374,40 @@ function countFilesRecursively(directory: string, timeoutMilliSeconds?: number):
});
}
function
validateFileName
(
fileName
:
string
):
boolean
{
let
pattern
:
string
=
'
^[a-z0-9A-Z
\
.-_]+$
'
;
const
validateResult
=
fileName
.
match
(
pattern
);
if
(
validateResult
)
{
return
true
;
}
return
false
;
}
async
function
validateFileNameRecursively
(
directory
:
string
):
Promise
<
boolean
>
{
if
(
!
fs
.
existsSync
(
directory
))
{
throw
Error
(
`Direcotory
${
directory
}
doesn't exist`
);
}
const
fileNameArray
:
string
[]
=
fs
.
readdirSync
(
directory
);
let
result
=
true
;
for
(
var
name
of
fileNameArray
){
const
fullFilePath
:
string
=
path
.
join
(
directory
,
name
);
try
{
// validate file names and directory names
result
=
validateFileName
(
name
);
if
(
fs
.
lstatSync
(
fullFilePath
).
isDirectory
())
{
result
=
result
&&
await
validateFileNameRecursively
(
fullFilePath
);
}
if
(
!
result
)
{
return
Promise
.
reject
(
new
Error
(
`file name in
${
fullFilePath
}
is not valid!`
));
}
}
catch
(
error
)
{
return
Promise
.
reject
(
error
);
}
}
return
Promise
.
resolve
(
result
);
}
/**
* get the version of current package
*/
...
...
@@ -385,7 +419,7 @@ async function getVersion(): Promise<string> {
deferred
.
reject
(
error
);
});
return
deferred
.
promise
;
}
}
/**
* run command as ChildProcess
...
...
@@ -437,7 +471,7 @@ async function isAlive(pid:any): Promise<boolean> {
}
/**
* kill process
* kill process
*/
async
function
killPid
(
pid
:
any
):
Promise
<
void
>
{
let
deferred
:
Deferred
<
void
>
=
new
Deferred
<
void
>
();
...
...
@@ -466,7 +500,7 @@ function getNewLine(): string {
/**
* Use '/' to join path instead of '\' for all kinds of platform
* @param path
* @param path
*/
function
unixPathJoin
(...
paths
:
any
[]):
string
{
const
dir
:
string
=
paths
.
filter
((
path
:
any
)
=>
path
!==
''
).
join
(
'
/
'
);
...
...
@@ -474,6 +508,6 @@ function unixPathJoin(...paths: any[]): string {
return
dir
;
}
export
{
countFilesRecursively
,
getRemoteTmpDir
,
generateParamFileName
,
getMsgDispatcherCommand
,
getCheckpointDir
,
export
{
countFilesRecursively
,
validateFileNameRecursively
,
getRemoteTmpDir
,
generateParamFileName
,
getMsgDispatcherCommand
,
getCheckpointDir
,
getLogDir
,
getExperimentRootDir
,
getJobCancelStatus
,
getDefaultDatabaseDir
,
getIPV4Address
,
unixPathJoin
,
mkDirP
,
delay
,
prepareUnitTest
,
parseArg
,
cleanupUnitTest
,
uniqueString
,
randomSelect
,
getLogLevel
,
getVersion
,
getCmdPy
,
getTunerProc
,
isAlive
,
killPid
,
getNewLine
};
src/nni_manager/config/frameworkcontroller/frameworkcontrollerjob-crd-v1.json
View file @
ba8dccd6
{
"kind"
:
"CustomResourceDefinition"
,
"kind"
:
"CustomResourceDefinition"
,
"spec"
:
{
"scope"
:
"Namespaced"
,
"version"
:
"v1"
,
"group"
:
"frameworkcontroller.microsoft.com"
,
"scope"
:
"Namespaced"
,
"version"
:
"v1"
,
"group"
:
"frameworkcontroller.microsoft.com"
,
"names"
:
{
"kind"
:
"Framework"
,
"plural"
:
"frameworks"
,
"kind"
:
"Framework"
,
"plural"
:
"frameworks"
,
"singular"
:
"framework"
}
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
"metadata"
:
{
"name"
:
"frameworks.frameworkcontroller.microsoft.com"
}
...
...
src/nni_manager/config/kubeflow/pytorchjob-crd-v1alpha2.json
View file @
ba8dccd6
{
"kind"
:
"CustomResourceDefinition"
,
"kind"
:
"CustomResourceDefinition"
,
"spec"
:
{
"scope"
:
"Namespaced"
,
"version"
:
"v1alpha2"
,
"group"
:
"kubeflow.org"
,
"scope"
:
"Namespaced"
,
"version"
:
"v1alpha2"
,
"group"
:
"kubeflow.org"
,
"names"
:
{
"kind"
:
"PyTorchJob"
,
"plural"
:
"pytorchjobs"
,
"kind"
:
"PyTorchJob"
,
"plural"
:
"pytorchjobs"
,
"singular"
:
"pytorchjob"
}
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
"metadata"
:
{
"name"
:
"pytorchjobs.kubeflow.org"
}
...
...
src/nni_manager/config/kubeflow/pytorchjob-crd-v1beta1.json
View file @
ba8dccd6
{
"kind"
:
"CustomResourceDefinition"
,
"kind"
:
"CustomResourceDefinition"
,
"spec"
:
{
"scope"
:
"Namespaced"
,
"version"
:
"v1beta1"
,
"group"
:
"kubeflow.org"
,
"scope"
:
"Namespaced"
,
"version"
:
"v1beta1"
,
"group"
:
"kubeflow.org"
,
"names"
:
{
"kind"
:
"PyTorchJob"
,
"plural"
:
"pytorchjobs"
,
"kind"
:
"PyTorchJob"
,
"plural"
:
"pytorchjobs"
,
"singular"
:
"pytorchjob"
}
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
"metadata"
:
{
"name"
:
"pytorchjobs.kubeflow.org"
}
...
...
src/nni_manager/config/kubeflow/tfjob-crd-v1alpha2.json
View file @
ba8dccd6
{
"kind"
:
"CustomResourceDefinition"
,
"kind"
:
"CustomResourceDefinition"
,
"spec"
:
{
"scope"
:
"Namespaced"
,
"version"
:
"v1alpha2"
,
"group"
:
"kubeflow.org"
,
"scope"
:
"Namespaced"
,
"version"
:
"v1alpha2"
,
"group"
:
"kubeflow.org"
,
"names"
:
{
"kind"
:
"TFJob"
,
"plural"
:
"tfjobs"
,
"kind"
:
"TFJob"
,
"plural"
:
"tfjobs"
,
"singular"
:
"tfjob"
}
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
"metadata"
:
{
"name"
:
"tfjobs.kubeflow.org"
}
...
...
src/nni_manager/config/kubeflow/tfjob-crd-v1beta1.json
View file @
ba8dccd6
{
"kind"
:
"CustomResourceDefinition"
,
"kind"
:
"CustomResourceDefinition"
,
"spec"
:
{
"scope"
:
"Namespaced"
,
"version"
:
"v1beta1"
,
"group"
:
"kubeflow.org"
,
"scope"
:
"Namespaced"
,
"version"
:
"v1beta1"
,
"group"
:
"kubeflow.org"
,
"names"
:
{
"kind"
:
"TFJob"
,
"plural"
:
"tfjobs"
,
"kind"
:
"TFJob"
,
"plural"
:
"tfjobs"
,
"singular"
:
"tfjob"
}
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
},
"apiVersion"
:
"apiextensions.k8s.io/v1beta1"
,
"metadata"
:
{
"name"
:
"tfjobs.kubeflow.org"
}
...
...
src/nni_manager/core/nnimanager.ts
View file @
ba8dccd6
...
...
@@ -159,7 +159,7 @@ class NNIManager implements Manager {
if
(
expParams
.
logCollection
!==
undefined
)
{
this
.
trainingService
.
setClusterMetadata
(
'
log_collection
'
,
expParams
.
logCollection
.
toString
());
}
const
dispatcherCommand
:
string
=
getMsgDispatcherCommand
(
expParams
.
tuner
,
expParams
.
assessor
,
expParams
.
advisor
,
expParams
.
multiPhase
,
expParams
.
multiThread
);
this
.
log
.
debug
(
`dispatcher command:
${
dispatcherCommand
}
`
);
...
...
@@ -493,7 +493,7 @@ class NNIManager implements Manager {
// If trialConcurrency does not change, requestTrialNum equals finishedTrialJobNum.
// If trialConcurrency changes, for example, trialConcurrency increases by 2 (trialConcurrencyChange=2), then
// requestTrialNum equals 2 + finishedTrialJobNum and trialConcurrencyChange becomes 0.
// If trialConcurrency changes, for example, trialConcurrency decreases by 4 (trialConcurrencyChange=-4) and
// If trialConcurrency changes, for example, trialConcurrency decreases by 4 (trialConcurrencyChange=-4) and
// finishedTrialJobNum is 2, then requestTrialNum becomes -2. No trial will be requested from tuner,
// and trialConcurrencyChange becomes -2.
const
requestTrialNum
:
number
=
this
.
trialConcurrencyChange
+
finishedTrialJobNum
;
...
...
src/nni_manager/core/test/ipcInterface.test.ts
View file @
ba8dccd6
...
...
@@ -46,11 +46,11 @@ function runProcess(): Promise<Error | null> {
if
(
code
!==
0
)
{
deferred
.
resolve
(
new
Error
(
`return code:
${
code
}
`
));
}
else
{
let
str
=
proc
.
stdout
.
read
().
toString
();
let
str
=
proc
.
stdout
.
read
().
toString
();
if
(
str
.
search
(
"
\r\n
"
)
!=-
1
){
sentCommands
=
str
.
split
(
"
\r\n
"
);
}
else
{
else
{
sentCommands
=
str
.
split
(
'
\n
'
);
}
deferred
.
resolve
(
null
);
...
...
@@ -76,7 +76,7 @@ function runProcess(): Promise<Error | null> {
commandTooLong
=
error
;
}
// Command #4: FE is not tuner/assessor command, test the exception type of send non-valid command
// Command #4: FE is not tuner/assessor command, test the exception type of send non-valid command
try
{
dispatcher
.
sendCommand
(
'
FE
'
,
'
1
'
);
}
catch
(
error
)
{
...
...
src/nni_manager/core/test/mockedTrainingService.ts
View file @
ba8dccd6
...
...
@@ -59,10 +59,10 @@ class MockedTrainingService extends TrainingService {
},
sequenceId
:
0
};
public
listTrialJobs
():
Promise
<
TrialJobDetail
[]
>
{
const
deferred
=
new
Deferred
<
TrialJobDetail
[]
>
();
deferred
.
resolve
([
this
.
jobDetail1
,
this
.
jobDetail2
]);
return
deferred
.
promise
;
}
...
...
src/nni_manager/core/test/nnimanager.test.ts
View file @
ba8dccd6
...
...
@@ -104,7 +104,7 @@ describe('Unit test for nnimanager', function () {
maxSequenceId
:
0
,
revision
:
0
}
before
(
async
()
=>
{
await
initContainer
();
...
...
src/nni_manager/package.json
View file @
ba8dccd6
...
...
@@ -33,6 +33,7 @@
"@types/chai-as-promised"
:
"^7.1.0"
,
"@types/express"
:
"^4.16.0"
,
"@types/glob"
:
"^7.1.1"
,
"@types/js-base64"
:
"^2.3.1"
,
"@types/mocha"
:
"^5.2.5"
,
"@types/node"
:
"10.12.18"
,
"@types/request"
:
"^2.47.1"
,
...
...
src/nni_manager/rest_server/nniRestServer.ts
View file @
ba8dccd6
...
...
@@ -31,7 +31,7 @@ import { createRestHandler } from './restHandler';
* NNI Main rest server, provides rest API to support
* # nnictl CLI tool
* # NNI WebUI
*
*
*/
@
component
.
Singleton
export
class
NNIRestServer
extends
RestServer
{
...
...
src/nni_manager/rest_server/restHandler.ts
View file @
ba8dccd6
...
...
@@ -146,7 +146,7 @@ class NNIRestHandler {
});
});
}
private
importData
(
router
:
Router
):
void
{
router
.
post
(
'
/experiment/import-data
'
,
(
req
:
Request
,
res
:
Response
)
=>
{
this
.
nniManager
.
importData
(
JSON
.
stringify
(
req
.
body
)).
then
(()
=>
{
...
...
src/nni_manager/rest_server/restValidationSchemas.ts
View file @
ba8dccd6
...
...
@@ -133,7 +133,7 @@ export namespace ValidationSchemas {
})
}),
nni_manager_ip
:
joi
.
object
({
nniManagerIp
:
joi
.
string
().
min
(
1
)
nniManagerIp
:
joi
.
string
().
min
(
1
)
})
}
};
...
...
src/nni_manager/training_service/common/clusterJobRestServer.ts
View file @
ba8dccd6
...
...
@@ -20,22 +20,24 @@
'
use strict
'
;
import
*
as
assert
from
'
assert
'
;
import
{
Request
,
Response
,
Router
}
from
'
express
'
;
// tslint:disable-next-line:no-implicit-dependencies
import
*
as
bodyParser
from
'
body-parser
'
;
import
{
Request
,
Response
,
Router
}
from
'
express
'
;
import
*
as
fs
from
'
fs
'
;
import
*
as
path
from
'
path
'
;
import
{
Writable
}
from
'
stream
'
;
import
{
String
}
from
'
typescript-string-operations
'
;
import
*
as
component
from
'
../../common/component
'
;
import
*
as
fs
from
'
fs
'
import
*
as
path
from
'
path
'
import
{
getBasePort
,
getExperimentId
}
from
'
../../common/experimentStartupInfo
'
;
import
{
RestServer
}
from
'
../../common/restServer
'
import
{
RestServer
}
from
'
../../common/restServer
'
;
import
{
getLogDir
}
from
'
../../common/utils
'
;
import
{
Writable
}
from
'
stream
'
;
/**
* Cluster Job Training service Rest server, provides rest API to support Cluster job metrics update
*
*
*/
@
component
.
Singleton
export
abstract
class
ClusterJobRestServer
extends
RestServer
{
export
abstract
class
ClusterJobRestServer
extends
RestServer
{
private
readonly
API_ROOT_URL
:
string
=
'
/api/v1/nni-pai
'
;
private
readonly
NNI_METRICS_PATTERN
:
string
=
`NNISDK_MEb'(?<metrics>.*?)'`
;
...
...
@@ -51,22 +53,23 @@ export abstract class ClusterJobRestServer extends RestServer{
constructor
()
{
super
();
const
basePort
:
number
=
getBasePort
();
assert
(
basePort
&&
basePort
>
1024
);
this
.
port
=
basePort
+
1
;
assert
(
basePort
!==
undefined
&&
basePort
>
1024
);
this
.
port
=
basePort
+
1
;
}
public
get
clusterRestServerPort
():
number
{
if
(
!
this
.
port
)
{
if
(
this
.
port
===
undefined
)
{
throw
new
Error
(
'
PAI Rest server port is undefined
'
);
}
return
this
.
port
;
}
public
get
getErrorMessage
():
string
|
undefined
{
public
get
getErrorMessage
():
string
|
undefined
{
return
this
.
errorMessage
;
}
public
set
setEnableVersionCheck
(
versionCheck
:
boolean
)
{
this
.
enableVersionCheck
=
versionCheck
;
}
...
...
@@ -79,11 +82,15 @@ export abstract class ClusterJobRestServer extends RestServer{
this
.
app
.
use
(
this
.
API_ROOT_URL
,
this
.
createRestHandler
());
}
// Abstract method to handle trial metrics data
// tslint:disable-next-line:no-any
protected
abstract
handleTrialMetrics
(
jobId
:
string
,
trialMetrics
:
any
[])
:
void
;
// tslint:disable: no-unsafe-any no-any
private
createRestHandler
()
:
Router
{
const
router
:
Router
=
Router
();
// tslint:disable-next-line:typedef
router
.
use
((
req
:
Request
,
res
:
Response
,
next
)
=>
{
router
.
use
((
req
:
Request
,
res
:
Response
,
next
:
any
)
=>
{
this
.
log
.
info
(
`
${
req
.
method
}
:
${
req
.
url
}
: body:\n
${
JSON
.
stringify
(
req
.
body
,
undefined
,
4
)}
`
);
res
.
setHeader
(
'
Content-Type
'
,
'
application/json
'
);
next
();
...
...
@@ -92,7 +99,7 @@ export abstract class ClusterJobRestServer extends RestServer{
router
.
post
(
`/version/
${
this
.
expId
}
/:trialId`
,
(
req
:
Request
,
res
:
Response
)
=>
{
if
(
this
.
enableVersionCheck
)
{
try
{
const
checkResultSuccess
:
boolean
=
req
.
body
.
tag
===
'
VCSuccess
'
?
true
:
false
;
const
checkResultSuccess
:
boolean
=
req
.
body
.
tag
===
'
VCSuccess
'
?
true
:
false
;
if
(
this
.
versionCheckSuccess
!==
undefined
&&
this
.
versionCheckSuccess
!==
checkResultSuccess
)
{
this
.
errorMessage
=
'
Version check error, version check result is inconsistent!
'
;
this
.
log
.
error
(
this
.
errorMessage
);
...
...
@@ -103,7 +110,7 @@ export abstract class ClusterJobRestServer extends RestServer{
this
.
versionCheckSuccess
=
false
;
this
.
errorMessage
=
req
.
body
.
msg
;
}
}
catch
(
err
)
{
}
catch
(
err
)
{
this
.
log
.
error
(
`json parse metrics error:
${
err
}
`
);
res
.
status
(
500
);
res
.
send
(
err
.
message
);
...
...
@@ -122,8 +129,7 @@ export abstract class ClusterJobRestServer extends RestServer{
this
.
handleTrialMetrics
(
req
.
body
.
jobId
,
req
.
body
.
metrics
);
res
.
send
();
}
catch
(
err
)
{
}
catch
(
err
)
{
this
.
log
.
error
(
`json parse metrics error:
${
err
}
`
);
res
.
status
(
500
);
res
.
send
(
err
.
message
);
...
...
@@ -131,35 +137,37 @@ export abstract class ClusterJobRestServer extends RestServer{
});
router
.
post
(
`/stdout/
${
this
.
expId
}
/:trialId`
,
(
req
:
Request
,
res
:
Response
)
=>
{
if
(
this
.
enableVersionCheck
&&
!
this
.
versionCheckSuccess
&&
!
this
.
errorMessage
)
{
this
.
errorMessage
=
`Version check failed, didn't get version check response from trialKeeper, please check your NNI version in `
+
`NNIManager and TrialKeeper!`
if
(
this
.
enableVersionCheck
&&
(
this
.
versionCheckSuccess
===
undefined
||
!
this
.
versionCheckSuccess
)
&&
this
.
errorMessage
===
undefined
)
{
this
.
errorMessage
=
`Version check failed, didn't get version check response from trialKeeper,`
+
` please check your NNI version in NNIManager and TrialKeeper!`
;
}
const
trialLogPath
:
string
=
path
.
join
(
getLogDir
(),
`trial_
${
req
.
params
.
trialId
}
.log`
);
try
{
let
skipLogging
:
boolean
=
false
;
if
(
req
.
body
.
tag
===
'
trial
'
&&
req
.
body
.
msg
!==
undefined
)
{
const
metricsContent
=
req
.
body
.
msg
.
match
(
this
.
NNI_METRICS_PATTERN
);
if
(
metricsContent
&&
metricsContent
.
groups
)
{
this
.
handleTrialMetrics
(
req
.
params
.
trialId
,
[
metricsContent
.
groups
[
'
metrics
'
]]);
if
(
req
.
body
.
tag
===
'
trial
'
&&
req
.
body
.
msg
!==
undefined
)
{
const
metricsContent
:
any
=
req
.
body
.
msg
.
match
(
this
.
NNI_METRICS_PATTERN
);
if
(
metricsContent
&&
metricsContent
.
groups
)
{
const
key
:
string
=
'
metrics
'
;
this
.
handleTrialMetrics
(
req
.
params
.
trialId
,
[
metricsContent
.
groups
[
key
]]);
skipLogging
=
true
;
}
}
if
(
!
skipLogging
){
if
(
!
skipLogging
)
{
// Construct write stream to write remote trial's log into local file
// tslint:disable-next-line:non-literal-fs-path
const
writeStream
:
Writable
=
fs
.
createWriteStream
(
trialLogPath
,
{
flags
:
'
a+
'
,
encoding
:
'
utf8
'
,
autoClose
:
true
});
writeStream
.
write
(
req
.
body
.
msg
+
'
\n
'
);
writeStream
.
write
(
String
.
Format
(
'
{0}
\n
'
,
req
.
body
.
msg
)
);
writeStream
.
end
();
}
res
.
send
();
}
catch
(
err
)
{
}
catch
(
err
)
{
this
.
log
.
error
(
`json parse stdout data error:
${
err
}
`
);
res
.
status
(
500
);
res
.
send
(
err
.
message
);
...
...
@@ -168,7 +176,5 @@ export abstract class ClusterJobRestServer extends RestServer{
return
router
;
}
/** Abstract method to handle trial metrics data */
protected
abstract
handleTrialMetrics
(
jobId
:
string
,
trialMetrics
:
any
[])
:
void
;
}
\ No newline at end of file
// tslint:enable: no-unsafe-any no-any
}
src/nni_manager/training_service/common/containerJobData.ts
View file @
ba8dccd6
...
...
@@ -19,12 +19,12 @@
'
use strict
'
;
export
const
CONTAINER_INSTALL_NNI_SHELL_FORMAT
:
string
=
export
const
CONTAINER_INSTALL_NNI_SHELL_FORMAT
:
string
=
`#!/bin/bash
if python3 -c 'import nni' > /dev/null 2>&1; then
# nni module is already installed, skip
return
else
# Install nni
python3 -m pip install --user --upgrade nni
fi`
;
\ No newline at end of file
python3 -m pip install --user --upgrade nni
fi`
;
Prev
1
2
3
4
5
6
7
8
9
…
11
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment