"...composable_kernel_rocm.git" did not exist on "733f33af78e38c15a9611a9a1436fe78b95d3433"
Commit a1f92666 authored by demianzhang's avatar demianzhang Committed by xuehui
Browse files

NNI on Windows for NNI Remote mode (#1073)

* test python

* test python36

* debug python

* debug python

* debug

* python version

* test python

* debug

* install nni

* install nni

* test powershell

* debug python

* test

* test python

* use python

* test python

* test python

* test

* update

* test powershell

* debug python

* debug python

* debug python

* debug powershell

* debug

* debug

* debug install.ps1

* add continueOnError: true

* debug

* debug

* update

* update

* add unittest

* test node

* update

* update joi

* debug joi

* add joi

* debug joi

* Update install

* update

* update

* add unittest

* add convert command

* add example

* fix windows commands

* debug

* fix tensorflow version

* fix pipeline

* update

* add gpu logic in windows

* update

* update

* debug

* fix commands

* fix commands

* update

* update

* Fix comments

* update

* fix kill command

* fix package.json

* Update package.json

* Refactor runScript

* Fix bug

* Fix comments

* Fix execKill

* Update

* Update

* Add unittest back

* Rollback install node

* Fix gpu memory

* Update

* Rollback check process

* Update mnist-hyperband.test.yml

* Update pipelines-it-local-windows.yml

* Update uninstall.ps1

* Fix virtual environment

* Fix tar

* Fix isAlive

* change gpu index logic

* test gpu index

* fix pipeline

* add cifar10

* fix cifar10

* remove gpu in cifar10

* test mnist gpu

* update

* debug

* Fix comments

* debug

* Update install.ps1

* debug

* update gpu metrics shell

* debug

* debug

* debug

* debug

* debug

* debug sigbreak

* Preinstall node-pre-gyp

* Update Installation.md

* Update Installation.md

* Remove install node-pre-gyp

* use taskkill to stop node process

* use ctl+c event to stop process

* add sigtrem signal in stop logic

* add ctl+break command

* Update isAlive

* debug sigterm

* Update pypi readme

* Update

* fix stop logic

* fix pipeline, add cifar10

* revert mnist, remove gpu

* Fix virtualenv

* Fix comments

* Update

* Update

* Fix install

* Update install.ps1

* Update install.ps1

* Fix comments

* Fix virtualenv install

* Update

* Update

* Fix comments

* Update

* Update install.ps1

* Update

* Update localTrainingService.ts

* Update

* Update

* Update

* Update

* Update

* Update util.ts

* Update utils.ts

* Fix system slash

* Update tmp dir

* Fix system slash

* Use python3 in remote

* Write tar command to file

* Update tar

* Update

* Update

* Fix stop

* Update StopSignal type

* Add removeTrialJobMetricListener

* remove Listeners

* Update listener

* Update

* Use Temp dir

* Use Temp dir

* Add remote windows pipeline

* Update pipelines-it-remote-windows.yml

* Update

* remote build wheel

* Update pipelines-it-remote-windows.yml

* debug

* debug

* Use docker source install

* Update

* Update

* Rollback remote build wheel

* Use self node and yarn

* Fix docker source install

* Rollback Makefile

* Upgrade docker pip

* Update

* Update

* Remote build wheel

* Use inline runOptions

* Hide wget output

* Add continueOnError

* Update

* Update

* Update

* Upgrade pip

* Add chmod

* Update

* debug

* Update

* Use pscp

* Update

* Download putty

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* Update

* debug

* exclude metis

* Refactor pathJoin

* Update

* debug metis

* debug metis

* Update

* Update dependency

* Fix comments

* Update

* Fix tslint

* Fix comments

* Fix comments

* add doc

* Fix comments

* Update

* Update doc
parent d8e1c4af
...@@ -106,7 +106,7 @@ We encourage researchers and students leverage these projects to accelerate the ...@@ -106,7 +106,7 @@ We encourage researchers and students leverage these projects to accelerate the
## **Install & Verify** ## **Install & Verify**
If you choose NNI Windows local mode and you use PowerShell to run script for the first time, you need to **run PowerShell as administrator** with this command first: If you are using NNI on Windows and use PowerShell to run script for the first time, you need to **run PowerShell as administrator** with this command first:
```bash ```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted Set-ExecutionPolicy -ExecutionPolicy Unrestricted
...@@ -114,7 +114,7 @@ If you choose NNI Windows local mode and you use PowerShell to run script for th ...@@ -114,7 +114,7 @@ If you choose NNI Windows local mode and you use PowerShell to run script for th
**Install through pip** **Install through pip**
* We support Linux, MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 along with Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. * We support Linux, MacOS and Windows(local, remote and pai mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 along with Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`.
Linux and MacOS Linux and MacOS
...@@ -131,12 +131,12 @@ python -m pip install --upgrade nni ...@@ -131,12 +131,12 @@ python -m pip install --upgrade nni
Note: Note:
* `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges. * `--user` can be added if you want to install NNI in your home directory, which does not require any special privileges.
* Currently NNI on Windows only support local mode. Anaconda or Miniconda is highly recommended to install NNI on Windows. * Currently NNI on Windows support local, remote and pai mode. Anaconda or Miniconda is highly recommended to install NNI on Windows.
* If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/FAQ.md) * If there is any error like `Segmentation fault`, please refer to [FAQ](docs/en_US/FAQ.md)
**Install through source code** **Install through source code**
* We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows local mode (10.1809) in our current stage. * We support Linux (Ubuntu 16.04 or higher), MacOS (10.14.1) and Windows (10.1809) in our current stage.
Linux and MacOS Linux and MacOS
...@@ -160,7 +160,7 @@ Windows ...@@ -160,7 +160,7 @@ Windows
For the system requirements of NNI, please refer to [Install NNI](docs/en_US/Installation.md) For the system requirements of NNI, please refer to [Install NNI](docs/en_US/Installation.md)
For NNI Windows local mode, please refer to [NNI Windows local mode](docs/en_US/WindowsLocalMode.md) For NNI on Windows, please refer to [NNI on Windows](docs/en_US/NniOnWindows.md)
**Verify install** **Verify install**
......
...@@ -20,22 +20,28 @@ ifeq ($(version_ts), true) ...@@ -20,22 +20,28 @@ ifeq ($(version_ts), true)
NNI_VERSION_VALUE := $(NNI_VERSION_VALUE).$(TIME_STAMP) NNI_VERSION_VALUE := $(NNI_VERSION_VALUE).$(TIME_STAMP)
endif endif
NNI_VERSION_TEMPLATE = 999.0.0-developing NNI_VERSION_TEMPLATE = 999.0.0-developing
NNI_YARN_TARBALL ?= $(CWD)nni-yarn.tar.gz
NNI_YARN_FOLDER ?= $(CWD)nni-yarn
NNI_YARN := PATH=$(CWD)node-$(OS_SPEC)-x64/bin:$${PATH} $(NNI_YARN_FOLDER)/bin/yarn
.PHONY: build .PHONY: build
build: build:
python3 -m pip install --user --upgrade setuptools wheel python3 -m pip install --user --upgrade setuptools wheel
wget https://aka.ms/nni/nodejs-download/$(OS_SPEC) -O $(CWD)node-$(OS_SPEC)-x64.tar.xz wget -q https://aka.ms/nni/nodejs-download/$(OS_SPEC) -O $(CWD)node-$(OS_SPEC)-x64.tar.xz
rm -rf $(CWD)node-$(OS_SPEC)-x64 rm -rf $(CWD)node-$(OS_SPEC)-x64
mkdir $(CWD)node-$(OS_SPEC)-x64 mkdir $(CWD)node-$(OS_SPEC)-x64
tar xf $(CWD)node-$(OS_SPEC)-x64.tar.xz -C node-$(OS_SPEC)-x64 --strip-components 1 tar xf $(CWD)node-$(OS_SPEC)-x64.tar.xz -C node-$(OS_SPEC)-x64 --strip-components 1
cd $(CWD)../../src/nni_manager && yarn && yarn build wget -q https://aka.ms/yarn-download -O $(NNI_YARN_TARBALL)
cd $(CWD)../../src/webui && yarn && yarn build rm -rf $(NNI_YARN_FOLDER)
mkdir $(NNI_YARN_FOLDER)
tar -xf $(NNI_YARN_TARBALL) -C $(NNI_YARN_FOLDER) --strip-components 1
cd $(CWD)../../src/nni_manager && $(NNI_YARN) && $(NNI_YARN) build
cd $(CWD)../../src/webui && $(NNI_YARN) && $(NNI_YARN) build
rm -rf $(CWD)nni rm -rf $(CWD)nni
cp -r $(CWD)../../src/nni_manager/dist $(CWD)nni cp -r $(CWD)../../src/nni_manager/dist $(CWD)nni
cp -r $(CWD)../../src/webui/build $(CWD)nni/static cp -r $(CWD)../../src/webui/build $(CWD)nni/static
cp $(CWD)../../src/nni_manager/package.json $(CWD)nni cp $(CWD)../../src/nni_manager/package.json $(CWD)nni
sed -ie 's/$(NNI_VERSION_TEMPLATE)/$(NNI_VERSION_VALUE)/' $(CWD)nni/package.json sed -ie 's/$(NNI_VERSION_TEMPLATE)/$(NNI_VERSION_VALUE)/' $(CWD)nni/package.json
cd $(CWD)nni && yarn --prod cd $(CWD)nni && $(NNI_YARN) --prod
cd $(CWD) && sed -ie 's/$(NNI_VERSION_TEMPLATE)/$(NNI_VERSION_VALUE)/' setup.py && python3 setup.py bdist_wheel -p $(WHEEL_SPEC) cd $(CWD) && sed -ie 's/$(NNI_VERSION_TEMPLATE)/$(NNI_VERSION_VALUE)/' setup.py && python3 setup.py bdist_wheel -p $(WHEEL_SPEC)
cd $(CWD) cd $(CWD)
...@@ -50,4 +56,4 @@ clean: ...@@ -50,4 +56,4 @@ clean:
rm -rf $(CWD)dist rm -rf $(CWD)dist
rm -rf $(CWD)nni rm -rf $(CWD)nni
rm -rf $(CWD)nni.egg-info rm -rf $(CWD)nni.egg-info
rm -rf $(CWD)node-$(OS_SPEC)-x64 rm -rf $(CWD)node-$(OS_SPEC)-x64
\ No newline at end of file
...@@ -36,8 +36,8 @@ Unable to open the WebUI may have the following reasons: ...@@ -36,8 +36,8 @@ Unable to open the WebUI may have the following reasons:
* If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment. * If you still can't see the WebUI after you use the server IP, you can check the proxy and the firewall of your machine. Or use the browser on the machine where you start your NNI experiment.
* Another reason may be your experiment is failed and NNI may fail to get the experiment infomation. You can check the log of NNImanager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log * Another reason may be your experiment is failed and NNI may fail to get the experiment infomation. You can check the log of NNImanager in the following directory: ~/nni/experiment/[your_experiment_id] /log/nnimanager.log
### Windows local mode problems ### NNI on Windows problems
Please refer to [NNI Windows local mode](WindowsLocalMode.md) Please refer to [NNI on Windows](NniOnWindows.md)
### Help us improve ### Help us improve
Please inquiry the problem in https://github.com/Microsoft/nni/issues to see whether there are other people already reported the problem, create a new one if there are no existing issues been created. Please inquiry the problem in https://github.com/Microsoft/nni/issues to see whether there are other people already reported the problem, create a new one if there are no existing issues been created.
# Installation of NNI # Installation of NNI
Currently we support installation on Linux, Mac and Windows(local mode). Currently we support installation on Linux, Mac and Windows(local, remote and pai mode).
## **Installation on Linux & Mac** ## **Installation on Linux & Mac**
......
# Windows Local Mode (experimental feature) # NNI on Windows (experimental feature)
Currently we only support local mode on Windows. Windows 10.1809 is well tested and recommended. Currently we support local, remote and pai mode on Windows. Windows 10.1809 is well tested and recommended.
## **Installation on Windows** ## **Installation on Windows**
...@@ -25,15 +25,15 @@ Set-ExecutionPolicy -ExecutionPolicy Unrestricted ...@@ -25,15 +25,15 @@ Set-ExecutionPolicy -ExecutionPolicy Unrestricted
Prerequisite: `python >=3.5`, `git`, `PowerShell` Prerequisite: `python >=3.5`, `git`, `PowerShell`
```bash ```bash
git clone -b v0.7 https://github.com/Microsoft/nni.git git clone -b v0.8 https://github.com/Microsoft/nni.git
cd nni cd nni
powershell ./install.ps1 powershell -file install.ps1
``` ```
When these things are done, use the **config_windows.yml** configuration to start an experiment for validation. When these things are done, use the **config_windows.yml** configuration to start an experiment for validation.
```bash ```bash
nnictl create --config nni/examples/trials/mnist/config_windows.yml nnictl create --config nni\examples\trials\mnist\config_windows.yml
``` ```
For other examples you need to change trial command `python3` into `python` in each example YAML. For other examples you need to change trial command `python3` into `python` in each example YAML.
......
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
## Installation ## Installation
We support Linux MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 and Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`. We support Linux MacOS and Windows in current stage, Ubuntu 16.04 or higher, MacOS 10.14.1 and Windows 10.1809 are tested and supported. Simply run the following `pip install` in an environment that has `python >= 3.5`.
#### Linux and MacOS #### Linux and MacOS
```bash ```bash
...@@ -10,7 +10,7 @@ We support Linux MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or ...@@ -10,7 +10,7 @@ We support Linux MacOS and Windows(local mode) in current stage, Ubuntu 16.04 or
``` ```
#### Windows #### Windows
If you choose Windows local mode and use PowerShell to run script, you need run below PowerShell command as administrator at first time. If you are using NNI on Windows, you need run below PowerShell command as administrator at first time.
```bash ```bash
Set-ExecutionPolicy -ExecutionPolicy Unrestricted Set-ExecutionPolicy -ExecutionPolicy Unrestricted
``` ```
...@@ -151,7 +151,7 @@ Run the **config.yml** file from your command line to start MNIST experiment. ...@@ -151,7 +151,7 @@ Run the **config.yml** file from your command line to start MNIST experiment.
#### Windows #### Windows
Run the **config_windows.yml** file from your command line to start MNIST experiment. Run the **config_windows.yml** file from your command line to start MNIST experiment.
**Note**, if you're using windows local mode, it needs to change `python3` to `python` in the config.yml file, or use the config_windows.yml file to start the experiment. **Note**, if you're using NNI on Windows, it needs to change `python3` to `python` in the config.yml file, or use the config_windows.yml file to start the experiment.
```bash ```bash
nnictl create --config nni/examples/trials/mnist/config_windows.yml nnictl create --config nni/examples/trials/mnist/config_windows.yml
......
...@@ -55,7 +55,8 @@ machineList: ...@@ -55,7 +55,8 @@ machineList:
username: bob username: bob
passwd: bob123 passwd: bob123
``` ```
You can use different systems to run experiments on the remote machine.
#### Linux and MacOS
Simply filling the `machineList` section and then run: Simply filling the `machineList` section and then run:
```bash ```bash
...@@ -64,5 +65,14 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml ...@@ -64,5 +65,14 @@ nnictl create --config ~/nni/examples/trials/mnist-annotation/config_remote.yml
to start the experiment. to start the experiment.
#### Windows
Simply filling the `machineList` section and then run:
```bash
nnictl create --config %userprofile%\nni\examples\trials\mnist-annotation\config_remote.yml
```
to start the experiment.
## version check ## version check
NNI support version check feature in since version 0.6, [refer](PaiMode.md) NNI support version check feature in since version 0.6, [refer](PaiMode.md)
\ No newline at end of file
...@@ -15,7 +15,7 @@ $yarnUrl = "https://yarnpkg.com/latest.tar.gz" ...@@ -15,7 +15,7 @@ $yarnUrl = "https://yarnpkg.com/latest.tar.gz"
$unzipNodeDir = "node-v*" $unzipNodeDir = "node-v*"
$unzipYarnDir = "yarn-v*" $unzipYarnDir = "yarn-v*"
$NNI_DEPENDENCY_FOLDER = "C:\tmp\$env:USERNAME" $NNI_DEPENDENCY_FOLDER = [System.IO.Path]::GetTempPath()+$env:USERNAME
$WHICH_PYTHON = where.exe python $WHICH_PYTHON = where.exe python
if($WHICH_PYTHON -eq $null){ if($WHICH_PYTHON -eq $null){
......
...@@ -43,11 +43,11 @@ function getExperimentRootDir(): string { ...@@ -43,11 +43,11 @@ function getExperimentRootDir(): string {
.getLogDir(); .getLogDir();
} }
function getLogDir(): string{ function getLogDir(): string {
return path.join(getExperimentRootDir(), 'log'); return path.join(getExperimentRootDir(), 'log');
} }
function getLogLevel(): string{ function getLogLevel(): string {
return getExperimentStartupInfo() return getExperimentStartupInfo()
.getLogLevel(); .getLogLevel();
} }
...@@ -149,7 +149,7 @@ function parseArg(names: string[]): string { ...@@ -149,7 +149,7 @@ function parseArg(names: string[]): string {
return ''; return '';
} }
function encodeCmdLineArgs(args:any):any{ function encodeCmdLineArgs(args: any): any {
if(process.platform === 'win32'){ if(process.platform === 'win32'){
return JSON.stringify(args); return JSON.stringify(args);
} }
...@@ -158,7 +158,7 @@ function encodeCmdLineArgs(args:any):any{ ...@@ -158,7 +158,7 @@ function encodeCmdLineArgs(args:any):any{
} }
} }
function getCmdPy():string{ function getCmdPy(): string {
let cmd = 'python3'; let cmd = 'python3';
if(process.platform === 'win32'){ if(process.platform === 'win32'){
cmd = 'python'; cmd = 'python';
...@@ -390,7 +390,7 @@ async function getVersion(): Promise<string> { ...@@ -390,7 +390,7 @@ async function getVersion(): Promise<string> {
/** /**
* run command as ChildProcess * run command as ChildProcess
*/ */
function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newEnv: any): ChildProcess{ function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newEnv: any): ChildProcess {
let cmd: string = command; let cmd: string = command;
let arg: string[] = []; let arg: string[] = [];
let newShell: boolean = true; let newShell: boolean = true;
...@@ -411,7 +411,7 @@ function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newE ...@@ -411,7 +411,7 @@ function getTunerProc(command: string, stdio: StdioOptions, newCwd: string, newE
/** /**
* judge whether the process is alive * judge whether the process is alive
*/ */
async function isAlive(pid:any): Promise<boolean>{ async function isAlive(pid:any): Promise<boolean> {
let deferred : Deferred<boolean> = new Deferred<boolean>(); let deferred : Deferred<boolean> = new Deferred<boolean>();
let alive: boolean = false; let alive: boolean = false;
if(process.platform ==='win32'){ if(process.platform ==='win32'){
...@@ -439,7 +439,7 @@ async function isAlive(pid:any): Promise<boolean>{ ...@@ -439,7 +439,7 @@ async function isAlive(pid:any): Promise<boolean>{
/** /**
* kill process * kill process
*/ */
async function killPid(pid:any): Promise<void>{ async function killPid(pid:any): Promise<void> {
let deferred : Deferred<void> = new Deferred<void>(); let deferred : Deferred<void> = new Deferred<void>();
try { try {
if (process.platform === "win32") { if (process.platform === "win32") {
...@@ -455,7 +455,7 @@ async function killPid(pid:any): Promise<void>{ ...@@ -455,7 +455,7 @@ async function killPid(pid:any): Promise<void>{
return deferred.promise; return deferred.promise;
} }
function getNewLine(): string{ function getNewLine(): string {
if (process.platform === "win32") { if (process.platform === "win32") {
return "\r\n"; return "\r\n";
} }
......
...@@ -58,7 +58,8 @@ class NNIManager implements Manager { ...@@ -58,7 +58,8 @@ class NNIManager implements Manager {
private status: NNIManagerStatus; private status: NNIManagerStatus;
private waitingTrials: string[]; private waitingTrials: string[];
private trialJobs: Map<string, TrialJobDetail>; private trialJobs: Map<string, TrialJobDetail>;
private trialJobMetricListener: (metric: TrialJobMetric) => void;
constructor() { constructor() {
this.currSubmittedTrialNum = 0; this.currSubmittedTrialNum = 0;
this.trialConcurrencyChange = 0; this.trialConcurrencyChange = 0;
...@@ -76,6 +77,11 @@ class NNIManager implements Manager { ...@@ -76,6 +77,11 @@ class NNIManager implements Manager {
status: 'INITIALIZED', status: 'INITIALIZED',
errors: [] errors: []
}; };
this.trialJobMetricListener = (metric: TrialJobMetric) => {
this.onTrialJobMetrics(metric).catch((err: Error) => {
this.criticalError(NNIError.FromError(err, 'Job metrics error: '));
});
};
} }
public updateExperimentProfile(experimentProfile: ExperimentProfile, updateType: ProfileUpdateType): Promise<void> { public updateExperimentProfile(experimentProfile: ExperimentProfile, updateType: ProfileUpdateType): Promise<void> {
...@@ -342,6 +348,7 @@ class NNIManager implements Manager { ...@@ -342,6 +348,7 @@ class NNIManager implements Manager {
if (this.dispatcher === undefined) { if (this.dispatcher === undefined) {
throw new Error('Error: tuner has not been setup'); throw new Error('Error: tuner has not been setup');
} }
this.trainingService.removeTrialJobMetricListener(this.trialJobMetricListener);
this.dispatcher.sendCommand(TERMINATE); this.dispatcher.sendCommand(TERMINATE);
let tunerAlive: boolean = true; let tunerAlive: boolean = true;
// gracefully terminate tuner and assessor here, wait at most 30 seconds. // gracefully terminate tuner and assessor here, wait at most 30 seconds.
...@@ -589,11 +596,7 @@ class NNIManager implements Manager { ...@@ -589,11 +596,7 @@ class NNIManager implements Manager {
if (this.dispatcher === undefined) { if (this.dispatcher === undefined) {
throw new Error('Error: tuner or job maintainer have not been setup'); throw new Error('Error: tuner or job maintainer have not been setup');
} }
this.trainingService.addTrialJobMetricListener((metric: TrialJobMetric) => { this.trainingService.addTrialJobMetricListener(this.trialJobMetricListener);
this.onTrialJobMetrics(metric).catch((err: Error) => {
this.criticalError(NNIError.FromError(err, 'Job metrics error: '));
});
});
this.dispatcher.onCommand((commandType: string, content: string) => { this.dispatcher.onCommand((commandType: string, content: string) => {
this.onTunerCommand(commandType, content).catch((err: Error) => { this.onTunerCommand(commandType, content).catch((err: Error) => {
......
...@@ -24,7 +24,10 @@ import { getLogger } from "common/log"; ...@@ -24,7 +24,10 @@ import { getLogger } from "common/log";
import { countFilesRecursively } from '../../common/utils' import { countFilesRecursively } from '../../common/utils'
import * as cpp from 'child-process-promise'; import * as cpp from 'child-process-promise';
import * as cp from 'child_process'; import * as cp from 'child_process';
import { GPU_INFO_COLLECTOR_FORMAT_LINUX, GPU_INFO_COLLECTOR_FORMAT_WINDOWS } from './gpuData' import * as os from 'os';
import * as fs from 'fs';
import { getNewLine } from '../../common/utils';
import { GPU_INFO_COLLECTOR_FORMAT_LINUX, GPU_INFO_COLLECTOR_FORMAT_WINDOWS } from './gpuData';
import * as path from 'path'; import * as path from 'path';
import { String } from 'typescript-string-operations'; import { String } from 'typescript-string-operations';
import { file } from "../../node_modules/@types/tmp"; import { file } from "../../node_modules/@types/tmp";
...@@ -66,6 +69,20 @@ export async function execMkdir(directory: string): Promise<void> { ...@@ -66,6 +69,20 @@ export async function execMkdir(directory: string): Promise<void> {
return Promise.resolve(); return Promise.resolve();
} }
/**
* copy files to the directory
* @param source
* @param destination
*/
export async function execCopydir(source: string, destination: string): Promise<void> {
if (process.platform === 'win32') {
await cpp.exec(`powershell.exe Copy-Item ${source} -Destination ${destination} -Recurse`);
} else {
await cpp.exec(`cp -r ${source} ${destination}`);
}
return Promise.resolve();
}
/** /**
* crete a new file * crete a new file
* @param filename * @param filename
...@@ -91,8 +108,6 @@ export function execScript(filePath: string): cp.ChildProcess { ...@@ -91,8 +108,6 @@ export function execScript(filePath: string): cp.ChildProcess {
} }
} }
/** /**
* output the last line of a file * output the last line of a file
* @param filePath * @param filePath
...@@ -111,9 +126,9 @@ export async function execTail(filePath: string): Promise<cpp.childProcessPromis ...@@ -111,9 +126,9 @@ export async function execTail(filePath: string): Promise<cpp.childProcessPromis
* delete a directory * delete a directory
* @param directory * @param directory
*/ */
export async function execRemove(directory: string): Promise<void>{ export async function execRemove(directory: string): Promise<void> {
if (process.platform === 'win32') { if (process.platform === 'win32') {
await cpp.exec(`powershell.exe Remove-Item ${directory}`); await cpp.exec(`powershell.exe Remove-Item ${directory} -Recurse -Force`);
} else { } else {
await cpp.exec(`rm -rf ${directory}`); await cpp.exec(`rm -rf ${directory}`);
} }
...@@ -124,7 +139,7 @@ export async function execRemove(directory: string): Promise<void>{ ...@@ -124,7 +139,7 @@ export async function execRemove(directory: string): Promise<void>{
* kill a process * kill a process
* @param directory * @param directory
*/ */
export async function execKill(pid: string): Promise<void>{ export async function execKill(pid: string): Promise<void> {
if (process.platform === 'win32') { if (process.platform === 'win32') {
await cpp.exec(`cmd /c taskkill /PID ${pid} /T /F`); await cpp.exec(`cmd /c taskkill /PID ${pid} /T /F`);
} else { } else {
...@@ -138,7 +153,7 @@ export async function execKill(pid: string): Promise<void>{ ...@@ -138,7 +153,7 @@ export async function execKill(pid: string): Promise<void>{
* @param variable * @param variable
* @returns command string * @returns command string
*/ */
export function setEnvironmentVariable(variable: { key: string; value: string }): string{ export function setEnvironmentVariable(variable: { key: string; value: string }): string {
if (process.platform === 'win32') { if (process.platform === 'win32') {
return `$env:${variable.key}="${variable.value}"`; return `$env:${variable.key}="${variable.value}"`;
} }
...@@ -147,6 +162,32 @@ export function setEnvironmentVariable(variable: { key: string; value: string }) ...@@ -147,6 +162,32 @@ export function setEnvironmentVariable(variable: { key: string; value: string })
} }
} }
/**
* Compress files in directory to tar file
* @param source_path
* @param tar_path
*/
export async function tarAdd(tar_path: string, source_path: string): Promise<void> {
if (process.platform === 'win32') {
tar_path = tar_path.split('\\').join('\\\\');
source_path = source_path.split('\\').join('\\\\');
let script: string[] = [];
script.push(
`import os`,
`import tarfile`,
String.Format(`tar = tarfile.open("{0}","w:gz")\r\nfor root,dir,files in os.walk("{1}"):`, tar_path, source_path),
` for file in files:`,
` fullpath = os.path.join(root,file)`,
` tar.add(fullpath, arcname=file)`,
`tar.close()`);
await fs.promises.writeFile(path.join(os.tmpdir(), 'tar.py'), script.join(getNewLine()), { encoding: 'utf8', mode: 0o777 });
const tarScript: string = path.join(os.tmpdir(), 'tar.py');
await cpp.exec(`python ${tarScript}`);
} else {
await cpp.exec(`tar -czf ${tar_path} -C ${source_path} .`);
}
return Promise.resolve();
}
/** /**
* generate script file name * generate script file name
......
...@@ -36,7 +36,7 @@ import { ObservableTimer } from '../../common/observableTimer'; ...@@ -36,7 +36,7 @@ import { ObservableTimer } from '../../common/observableTimer';
import { import {
HostJobApplicationForm, HyperParameters, JobApplicationForm, TrainingService, TrialJobApplicationForm, TrialJobDetail, TrialJobMetric, NNIManagerIpConfig HostJobApplicationForm, HyperParameters, JobApplicationForm, TrainingService, TrialJobApplicationForm, TrialJobDetail, TrialJobMetric, NNIManagerIpConfig
} from '../../common/trainingService'; } from '../../common/trainingService';
import { delay, generateParamFileName, getExperimentRootDir, uniqueString, getJobCancelStatus, getRemoteTmpDir,getIPV4Address } from '../../common/utils'; import { delay, generateParamFileName, getExperimentRootDir, uniqueString, getJobCancelStatus, getRemoteTmpDir,getIPV4Address, getVersion, unixPathJoin } from '../../common/utils';
import { GPUSummary } from '../common/gpuData'; import { GPUSummary } from '../common/gpuData';
import { TrialConfig } from '../common/trialConfig'; import { TrialConfig } from '../common/trialConfig';
import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey'; import { TrialConfigMetadataKey } from '../common/trialConfigMetadataKey';
...@@ -48,10 +48,9 @@ import { ...@@ -48,10 +48,9 @@ import {
} from './remoteMachineData'; } from './remoteMachineData';
import { GPU_INFO_COLLECTOR_FORMAT_LINUX } from '../common/gpuData'; import { GPU_INFO_COLLECTOR_FORMAT_LINUX } from '../common/gpuData';
import { SSHClientUtility } from './sshClientUtility'; import { SSHClientUtility } from './sshClientUtility';
import { validateCodeDir } from '../common/util'; import { validateCodeDir, execRemove, execMkdir, execCopydir } from '../common/util';
import { RemoteMachineJobRestServer } from './remoteMachineJobRestServer'; import { RemoteMachineJobRestServer } from './remoteMachineJobRestServer';
import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData'; import { CONTAINER_INSTALL_NNI_SHELL_FORMAT } from '../common/containerJobData';
import { mkDirP, getVersion } from '../../common/utils';
/** /**
* Training Service implementation for Remote Machine (Linux) * Training Service implementation for Remote Machine (Linux)
...@@ -234,7 +233,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -234,7 +233,7 @@ class RemoteMachineTrainingService implements TrainingService {
} else if (form.jobType === 'TRIAL') { } else if (form.jobType === 'TRIAL') {
// Generate trial job id(random) // Generate trial job id(random)
const trialJobId: string = uniqueString(5); const trialJobId: string = uniqueString(5);
const trialWorkingFolder: string = path.join(this.remoteExpRootDir, 'trials', trialJobId); const trialWorkingFolder: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJobId);
const trialJobDetail: RemoteMachineTrialJobDetail = new RemoteMachineTrialJobDetail( const trialJobDetail: RemoteMachineTrialJobDetail = new RemoteMachineTrialJobDetail(
trialJobId, trialJobId,
...@@ -354,7 +353,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -354,7 +353,7 @@ class RemoteMachineTrainingService implements TrainingService {
case TrialConfigMetadataKey.MACHINE_LIST: case TrialConfigMetadataKey.MACHINE_LIST:
await this.setupConnections(value); await this.setupConnections(value);
//remove local temp files //remove local temp files
await cpp.exec(`rm -rf ${this.getLocalGpuMetricCollectorDir()}`); await execRemove(this.getLocalGpuMetricCollectorDir());
break; break;
case TrialConfigMetadataKey.TRIAL_CONFIG: case TrialConfigMetadataKey.TRIAL_CONFIG:
const remoteMachineTrailConfig: TrialConfig = <TrialConfig>JSON.parse(value); const remoteMachineTrailConfig: TrialConfig = <TrialConfig>JSON.parse(value);
...@@ -417,7 +416,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -417,7 +416,7 @@ class RemoteMachineTrainingService implements TrainingService {
private async cleanupConnections(): Promise<void> { private async cleanupConnections(): Promise<void> {
try{ try{
for (const [rmMeta, sshClientManager] of this.machineSSHClientMap.entries()) { for (const [rmMeta, sshClientManager] of this.machineSSHClientMap.entries()) {
let jobpidPath: string = path.join(this.getRemoteScriptsPath(rmMeta.username), 'pid'); let jobpidPath: string = unixPathJoin(this.getRemoteScriptsPath(rmMeta.username), 'pid');
let client: Client | undefined = sshClientManager.getFirstSSHClient(); let client: Client | undefined = sshClientManager.getFirstSSHClient();
if(client) { if(client) {
await SSHClientUtility.remoteExeCommand(`pkill -P \`cat ${jobpidPath}\``, client); await SSHClientUtility.remoteExeCommand(`pkill -P \`cat ${jobpidPath}\``, client);
...@@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -438,7 +437,7 @@ class RemoteMachineTrainingService implements TrainingService {
*/ */
private getLocalGpuMetricCollectorDir(): string { private getLocalGpuMetricCollectorDir(): string {
let userName: string = path.basename(os.homedir()); //get current user name of os let userName: string = path.basename(os.homedir()); //get current user name of os
return `${os.tmpdir()}/${userName}/nni/scripts/`; return path.join(os.tmpdir(), userName, 'nni', 'scripts');
} }
/** /**
...@@ -447,14 +446,14 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -447,14 +446,14 @@ class RemoteMachineTrainingService implements TrainingService {
*/ */
private async generateGpuMetricsCollectorScript(userName: string): Promise<void> { private async generateGpuMetricsCollectorScript(userName: string): Promise<void> {
let gpuMetricCollectorScriptFolder : string = this.getLocalGpuMetricCollectorDir(); let gpuMetricCollectorScriptFolder : string = this.getLocalGpuMetricCollectorDir();
await cpp.exec(`mkdir -p ${path.join(gpuMetricCollectorScriptFolder, userName)}`); await execMkdir(path.join(gpuMetricCollectorScriptFolder, userName));
//generate gpu_metrics_collector.sh //generate gpu_metrics_collector.sh
let gpuMetricsCollectorScriptPath: string = path.join(gpuMetricCollectorScriptFolder, userName, 'gpu_metrics_collector.sh'); let gpuMetricsCollectorScriptPath: string = path.join(gpuMetricCollectorScriptFolder, userName, 'gpu_metrics_collector.sh');
const remoteGPUScriptsDir: string = this.getRemoteScriptsPath(userName); // This directory is used to store gpu_metrics and pid created by script const remoteGPUScriptsDir: string = this.getRemoteScriptsPath(userName); // This directory is used to store gpu_metrics and pid created by script
const gpuMetricsCollectorScriptContent: string = String.Format( const gpuMetricsCollectorScriptContent: string = String.Format(
GPU_INFO_COLLECTOR_FORMAT_LINUX, GPU_INFO_COLLECTOR_FORMAT_LINUX,
remoteGPUScriptsDir, remoteGPUScriptsDir,
path.join(remoteGPUScriptsDir, 'pid'), unixPathJoin(remoteGPUScriptsDir, 'pid'),
); );
await fs.promises.writeFile(gpuMetricsCollectorScriptPath, gpuMetricsCollectorScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(gpuMetricsCollectorScriptPath, gpuMetricsCollectorScriptContent, { encoding: 'utf8' });
} }
...@@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -481,7 +480,7 @@ class RemoteMachineTrainingService implements TrainingService {
private async initRemoteMachineOnConnected(rmMeta: RemoteMachineMeta, conn: Client): Promise<void> { private async initRemoteMachineOnConnected(rmMeta: RemoteMachineMeta, conn: Client): Promise<void> {
// Create root working directory after ssh connection is ready // Create root working directory after ssh connection is ready
await this.generateGpuMetricsCollectorScript(rmMeta.username); //generate gpu script in local machine first, will copy to remote machine later await this.generateGpuMetricsCollectorScript(rmMeta.username); //generate gpu script in local machine first, will copy to remote machine later
const nniRootDir: string = `${os.tmpdir()}/nni`; const nniRootDir: string = unixPathJoin(getRemoteTmpDir(this.remoteOS), 'nni');
await SSHClientUtility.remoteExeCommand(`mkdir -p ${this.remoteExpRootDir}`, conn); await SSHClientUtility.remoteExeCommand(`mkdir -p ${this.remoteExpRootDir}`, conn);
// Copy NNI scripts to remote expeirment working directory // Copy NNI scripts to remote expeirment working directory
...@@ -490,15 +489,15 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -490,15 +489,15 @@ class RemoteMachineTrainingService implements TrainingService {
await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn); await SSHClientUtility.remoteExeCommand(`mkdir -p ${remoteGpuScriptCollectorDir}`, conn);
await SSHClientUtility.remoteExeCommand(`chmod 777 ${nniRootDir} ${nniRootDir}/* ${nniRootDir}/scripts/*`, conn); await SSHClientUtility.remoteExeCommand(`chmod 777 ${nniRootDir} ${nniRootDir}/* ${nniRootDir}/scripts/*`, conn);
//copy gpu_metrics_collector.sh to remote //copy gpu_metrics_collector.sh to remote
await SSHClientUtility.copyFileToRemote(path.join(localGpuScriptCollectorDir, rmMeta.username, 'gpu_metrics_collector.sh'), path.join(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh'), conn); await SSHClientUtility.copyFileToRemote(path.join(localGpuScriptCollectorDir, rmMeta.username, 'gpu_metrics_collector.sh'), unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh'), conn);
//Begin to execute gpu_metrics_collection scripts //Begin to execute gpu_metrics_collection scripts
SSHClientUtility.remoteExeCommand(`bash ${path.join(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh')}`, conn); SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics_collector.sh')}`, conn);
this.timer.subscribe( this.timer.subscribe(
async (tick: number) => { async (tick: number) => {
const cmdresult: RemoteCommandResult = await SSHClientUtility.remoteExeCommand( const cmdresult: RemoteCommandResult = await SSHClientUtility.remoteExeCommand(
`tail -n 1 ${path.join(remoteGpuScriptCollectorDir, 'gpu_metrics')}`, conn); `tail -n 1 ${unixPathJoin(remoteGpuScriptCollectorDir, 'gpu_metrics')}`, conn);
if (cmdresult && cmdresult.stdout) { if (cmdresult && cmdresult.stdout) {
rmMeta.gpuSummary = <GPUSummary>JSON.parse(cmdresult.stdout); rmMeta.gpuSummary = <GPUSummary>JSON.parse(cmdresult.stdout);
} }
...@@ -531,7 +530,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -531,7 +530,7 @@ class RemoteMachineTrainingService implements TrainingService {
} else if (rmScheduleResult.resultType === ScheduleResultType.SUCCEED } else if (rmScheduleResult.resultType === ScheduleResultType.SUCCEED
&& rmScheduleResult.scheduleInfo !== undefined) { && rmScheduleResult.scheduleInfo !== undefined) {
const rmScheduleInfo : RemoteMachineScheduleInfo = rmScheduleResult.scheduleInfo; const rmScheduleInfo : RemoteMachineScheduleInfo = rmScheduleResult.scheduleInfo;
const trialWorkingFolder: string = path.join(this.remoteExpRootDir, 'trials', trialJobId); const trialWorkingFolder: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJobId);
trialJobDetail.rmMeta = rmScheduleInfo.rmMeta; trialJobDetail.rmMeta = rmScheduleInfo.rmMeta;
...@@ -575,7 +574,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -575,7 +574,7 @@ class RemoteMachineTrainingService implements TrainingService {
const trialLocalTempFolder: string = path.join(this.expRootDir, 'trials-local', trialJobId); const trialLocalTempFolder: string = path.join(this.expRootDir, 'trials-local', trialJobId);
await SSHClientUtility.remoteExeCommand(`mkdir -p ${trialWorkingFolder}`, sshClient); await SSHClientUtility.remoteExeCommand(`mkdir -p ${trialWorkingFolder}`, sshClient);
await SSHClientUtility.remoteExeCommand(`mkdir -p ${path.join(trialWorkingFolder, '.nni')}`, sshClient); await SSHClientUtility.remoteExeCommand(`mkdir -p ${unixPathJoin(trialWorkingFolder, '.nni')}`, sshClient);
// RemoteMachineRunShellFormat is the run shell format string, // RemoteMachineRunShellFormat is the run shell format string,
// See definition in remoteMachineData.ts // See definition in remoteMachineData.ts
...@@ -603,20 +602,20 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -603,20 +602,20 @@ class RemoteMachineTrainingService implements TrainingService {
getExperimentId(), getExperimentId(),
trialJobDetail.sequenceId.toString(), trialJobDetail.sequenceId.toString(),
this.isMultiPhase, this.isMultiPhase,
path.join(trialWorkingFolder, '.nni', 'jobpid'), unixPathJoin(trialWorkingFolder, '.nni', 'jobpid'),
command, command,
nniManagerIp, nniManagerIp,
this.remoteRestServerPort, this.remoteRestServerPort,
version, version,
this.logCollection, this.logCollection,
path.join(trialWorkingFolder, '.nni', 'code') unixPathJoin(trialWorkingFolder, '.nni', 'code')
) )
//create tmp trial working folder locally. //create tmp trial working folder locally.
await cpp.exec(`mkdir -p ${path.join(trialLocalTempFolder, '.nni')}`); await execMkdir(path.join(trialLocalTempFolder, '.nni'));
//create tmp trial working folder locally. //create tmp trial working folder locally.
await cpp.exec(`cp -r ${this.trialConfig.codeDir}/* ${trialLocalTempFolder}`); await execCopydir(path.join(this.trialConfig.codeDir, '*'), trialLocalTempFolder);
const installScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT; const installScriptContent : string = CONTAINER_INSTALL_NNI_SHELL_FORMAT;
// Write NNI installation file to local tmp files // Write NNI installation file to local tmp files
await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), installScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(trialLocalTempFolder, 'install_nni.sh'), installScriptContent, { encoding: 'utf8' });
...@@ -626,7 +625,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -626,7 +625,7 @@ class RemoteMachineTrainingService implements TrainingService {
// Copy files in codeDir to remote working directory // Copy files in codeDir to remote working directory
await SSHClientUtility.copyDirectoryToRemote(trialLocalTempFolder, trialWorkingFolder, sshClient, this.remoteOS); await SSHClientUtility.copyDirectoryToRemote(trialLocalTempFolder, trialWorkingFolder, sshClient, this.remoteOS);
// Execute command in remote machine // Execute command in remote machine
SSHClientUtility.remoteExeCommand(`bash ${path.join(trialWorkingFolder, 'run.sh')}`, sshClient); SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(trialWorkingFolder, 'run.sh')}`, sshClient);
} }
private async runHostJob(form: HostJobApplicationForm): Promise<TrialJobDetail> { private async runHostJob(form: HostJobApplicationForm): Promise<TrialJobDetail> {
...@@ -646,8 +645,8 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -646,8 +645,8 @@ class RemoteMachineTrainingService implements TrainingService {
); );
await fs.promises.writeFile(path.join(localDir, 'run.sh'), runScriptContent, { encoding: 'utf8' }); await fs.promises.writeFile(path.join(localDir, 'run.sh'), runScriptContent, { encoding: 'utf8' });
await SSHClientUtility.copyFileToRemote( await SSHClientUtility.copyFileToRemote(
path.join(localDir, 'run.sh'), path.join(remoteDir, 'run.sh'), sshClient); path.join(localDir, 'run.sh'), unixPathJoin(remoteDir, 'run.sh'), sshClient);
SSHClientUtility.remoteExeCommand(`bash ${path.join(remoteDir, 'run.sh')}`, sshClient); SSHClientUtility.remoteExeCommand(`bash ${unixPathJoin(remoteDir, 'run.sh')}`, sshClient);
const jobDetail: RemoteMachineTrialJobDetail = new RemoteMachineTrialJobDetail( const jobDetail: RemoteMachineTrialJobDetail = new RemoteMachineTrialJobDetail(
jobId, 'RUNNING', Date.now(), remoteDir, form, this.generateSequenceId() jobId, 'RUNNING', Date.now(), remoteDir, form, this.generateSequenceId()
...@@ -672,7 +671,7 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -672,7 +671,7 @@ class RemoteMachineTrainingService implements TrainingService {
private async updateTrialJobStatus(trialJob: RemoteMachineTrialJobDetail, sshClient: Client): Promise<TrialJobDetail> { private async updateTrialJobStatus(trialJob: RemoteMachineTrialJobDetail, sshClient: Client): Promise<TrialJobDetail> {
const deferred: Deferred<TrialJobDetail> = new Deferred<TrialJobDetail>(); const deferred: Deferred<TrialJobDetail> = new Deferred<TrialJobDetail>();
const jobpidPath: string = this.getJobPidPath(trialJob.id); const jobpidPath: string = this.getJobPidPath(trialJob.id);
const trialReturnCodeFilePath: string = path.join(this.remoteExpRootDir, 'trials', trialJob.id, '.nni', 'code'); const trialReturnCodeFilePath: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJob.id, '.nni', 'code');
try { try {
const killResult: number = (await SSHClientUtility.remoteExeCommand(`kill -0 \`cat ${jobpidPath}\``, sshClient)).exitCode; const killResult: number = (await SSHClientUtility.remoteExeCommand(`kill -0 \`cat ${jobpidPath}\``, sshClient)).exitCode;
// if the process of jobpid is not alive any more // if the process of jobpid is not alive any more
...@@ -712,15 +711,15 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -712,15 +711,15 @@ class RemoteMachineTrainingService implements TrainingService {
} }
private getRemoteScriptsPath(userName: string): string { private getRemoteScriptsPath(userName: string): string {
return path.join(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts'); return unixPathJoin(getRemoteTmpDir(this.remoteOS), userName, 'nni', 'scripts');
} }
private getHostJobRemoteDir(jobId: string): string { private getHostJobRemoteDir(jobId: string): string {
return path.join(this.remoteExpRootDir, 'hostjobs', jobId); return unixPathJoin(this.remoteExpRootDir, 'hostjobs', jobId);
} }
private getRemoteExperimentRootDir(): string{ private getRemoteExperimentRootDir(): string{
return path.join(getRemoteTmpDir(this.remoteOS), 'nni', 'experiments', getExperimentId()); return unixPathJoin(getRemoteTmpDir(this.remoteOS), 'nni', 'experiments', getExperimentId());
} }
public get MetricsEmitter() : EventEmitter { public get MetricsEmitter() : EventEmitter {
...@@ -735,9 +734,9 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -735,9 +734,9 @@ class RemoteMachineTrainingService implements TrainingService {
let jobpidPath: string; let jobpidPath: string;
if (trialJobDetail.form.jobType === 'TRIAL') { if (trialJobDetail.form.jobType === 'TRIAL') {
jobpidPath = path.join(trialJobDetail.workingDirectory, '.nni', 'jobpid'); jobpidPath = unixPathJoin(trialJobDetail.workingDirectory, '.nni', 'jobpid');
} else if (trialJobDetail.form.jobType === 'HOST') { } else if (trialJobDetail.form.jobType === 'HOST') {
jobpidPath = path.join(this.getHostJobRemoteDir(jobId), 'jobpid'); jobpidPath = unixPathJoin(this.getHostJobRemoteDir(jobId), 'jobpid');
} else { } else {
throw new Error(`Job type not supported: ${trialJobDetail.form.jobType}`); throw new Error(`Job type not supported: ${trialJobDetail.form.jobType}`);
} }
...@@ -751,14 +750,14 @@ class RemoteMachineTrainingService implements TrainingService { ...@@ -751,14 +750,14 @@ class RemoteMachineTrainingService implements TrainingService {
throw new Error('sshClient is undefined.'); throw new Error('sshClient is undefined.');
} }
const trialWorkingFolder: string = path.join(this.remoteExpRootDir, 'trials', trialJobId); const trialWorkingFolder: string = unixPathJoin(this.remoteExpRootDir, 'trials', trialJobId);
const trialLocalTempFolder: string = path.join(this.expRootDir, 'trials-local', trialJobId); const trialLocalTempFolder: string = path.join(this.expRootDir, 'trials-local', trialJobId);
const fileName: string = generateParamFileName(hyperParameters); const fileName: string = generateParamFileName(hyperParameters);
const localFilepath: string = path.join(trialLocalTempFolder, fileName); const localFilepath: string = path.join(trialLocalTempFolder, fileName);
await fs.promises.writeFile(localFilepath, hyperParameters.value, { encoding: 'utf8' }); await fs.promises.writeFile(localFilepath, hyperParameters.value, { encoding: 'utf8' });
await SSHClientUtility.copyFileToRemote(localFilepath, path.join(trialWorkingFolder, fileName), sshClient); await SSHClientUtility.copyFileToRemote(localFilepath, unixPathJoin(trialWorkingFolder, fileName), sshClient);
} }
private generateSequenceId(): number { private generateSequenceId(): number {
......
...@@ -28,8 +28,9 @@ import * as stream from 'stream'; ...@@ -28,8 +28,9 @@ import * as stream from 'stream';
import { Deferred } from 'ts-deferred'; import { Deferred } from 'ts-deferred';
import { NNIError, NNIErrorNames } from '../../common/errors'; import { NNIError, NNIErrorNames } from '../../common/errors';
import { getLogger, Logger } from '../../common/log'; import { getLogger, Logger } from '../../common/log';
import { uniqueString, getRemoteTmpDir } from '../../common/utils'; import { uniqueString, getRemoteTmpDir, unixPathJoin } from '../../common/utils';
import { RemoteCommandResult } from './remoteMachineData'; import { RemoteCommandResult } from './remoteMachineData';
import { execRemove, tarAdd } from '../common/util';
/** /**
* *
...@@ -47,13 +48,13 @@ export namespace SSHClientUtility { ...@@ -47,13 +48,13 @@ export namespace SSHClientUtility {
const deferred: Deferred<void> = new Deferred<void>(); const deferred: Deferred<void> = new Deferred<void>();
const tmpTarName: string = `${uniqueString(10)}.tar.gz`; const tmpTarName: string = `${uniqueString(10)}.tar.gz`;
const localTarPath: string = path.join(os.tmpdir(), tmpTarName); const localTarPath: string = path.join(os.tmpdir(), tmpTarName);
const remoteTarPath: string = path.join(getRemoteTmpDir(remoteOS), tmpTarName); const remoteTarPath: string = unixPathJoin(getRemoteTmpDir(remoteOS), tmpTarName);
// Compress files in local directory to experiment root directory // Compress files in local directory to experiment root directory
await cpp.exec(`tar -czf ${localTarPath} -C ${localDirectory} .`); await tarAdd(localTarPath, localDirectory);
// Copy the compressed file to remoteDirectory and delete it // Copy the compressed file to remoteDirectory and delete it
await copyFileToRemote(localTarPath, remoteTarPath, sshClient); await copyFileToRemote(localTarPath, remoteTarPath, sshClient);
await cpp.exec(`rm ${localTarPath}`); await execRemove(localTarPath);
// Decompress the remote compressed file in and delete it // Decompress the remote compressed file in and delete it
await remoteExeCommand(`tar -oxzf ${remoteTarPath} -C ${remoteDirectory}`, sshClient); await remoteExeCommand(`tar -oxzf ${remoteTarPath} -C ${remoteDirectory}`, sshClient);
await remoteExeCommand(`rm ${remoteTarPath}`, sshClient); await remoteExeCommand(`rm ${remoteTarPath}`, sshClient);
......
jobs:
- job: 'integration_test_remote_windows'
steps:
- script: python -m pip install --upgrade pip setuptools
displayName: 'Install python tools'
- task: CopyFilesOverSSH@0
inputs:
sshEndpoint: $(end_point)
targetFolder: /tmp/nnitest/$(Build.BuildId)/nni-remote
overwrite: true
displayName: 'Copy all files to remote machine'
- script: |
powershell.exe -file install.ps1
displayName: 'Install nni toolkit via source code'
- script: |
python -m pip install scikit-learn==0.20.1 --user
displayName: 'Install dependencies for integration tests'
- task: SSH@0
inputs:
sshEndpoint: $(end_point)
runOptions: inline
inline: cd /tmp/nnitest/$(Build.BuildId)/nni-remote/deployment/pypi;make build
continueOnError: true
displayName: 'build nni bdsit_wheel'
- task: SSH@0
inputs:
sshEndpoint: $(end_point)
runOptions: commands
commands: python3 /tmp/nnitest/$(Build.BuildId)/nni-remote/test/remote_docker.py --mode start --name $(Build.BuildId) --image nni/nni --os windows
displayName: 'Start docker'
- powershell: |
Write-Host "Downloading Putty..."
(New-Object Net.WebClient).DownloadFile("https://the.earth.li/~sgtatham/putty/latest/w64/pscp.exe", "$(Agent.TempDirectory)\pscp.exe")
$(Agent.TempDirectory)\pscp.exe -hostkey $(hostkey) -pw $(pscp_pwd) $(remote_user)@$(remote_host):/tmp/nnitest/$(Build.BuildId)/port test\port
Get-Content test\port
displayName: 'Get docker port'
- powershell: |
cd test
python generate_ts_config.py --ts remote --remote_user $(docker_user) --remote_host $(remote_host) --remote_port $(Get-Content port) --remote_pwd $(docker_pwd) --nni_manager_ip $(nni_manager_ip)
Get-Content training_service.yml
python config_test.py --ts remote --exclude cifar10,smac,bohb
displayName: 'integration test'
- task: SSH@0
inputs:
sshEndpoint: $(end_point)
runOptions: commands
commands: python3 /tmp/nnitest/$(Build.BuildId)/nni-remote/test/remote_docker.py --mode stop --name $(Build.BuildId) --os windows
displayName: 'Stop docker'
...@@ -30,18 +30,33 @@ def find_wheel_package(dir): ...@@ -30,18 +30,33 @@ def find_wheel_package(dir):
return file_name return file_name
return None return None
def start_container(image, name): def start_container(image, name, nnimanager_os):
'''Start docker container, generate a port in /tmp/nnitest/{name}/port file''' '''Start docker container, generate a port in /tmp/nnitest/{name}/port file'''
port = find_port() port = find_port()
source_dir = '/tmp/nnitest/' + name source_dir = '/tmp/nnitest/' + name
run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image] run_cmds = ['docker', 'run', '-d', '-p', str(port) + ':22', '--name', name, '--mount', 'type=bind,source=' + source_dir + ',target=/tmp/nni', image]
output = check_output(run_cmds) output = check_output(run_cmds)
commit_id = output.decode('utf-8') commit_id = output.decode('utf-8')
wheel_name = find_wheel_package(os.path.join(source_dir, 'dist'))
if nnimanager_os == 'windows':
wheel_name = find_wheel_package(os.path.join(source_dir, 'nni-remote/deployment/pypi/dist'))
else:
wheel_name = find_wheel_package(os.path.join(source_dir, 'dist'))
if not wheel_name: if not wheel_name:
print('Error: could not find wheel package in {0}'.format(source_dir)) print('Error: could not find wheel package in {0}'.format(source_dir))
exit(1) exit(1)
sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '/tmp/nni/dist/{0}'.format(wheel_name)]
def get_dist(wheel_name):
'''get the wheel package path'''
if nnimanager_os == 'windows':
return '/tmp/nni/nni-remote/deployment/pypi/dist/{0}'.format(wheel_name)
else:
return '/tmp/nni/dist/{0}'.format(wheel_name)
pip_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', '--upgrade', 'pip']
check_call(pip_cmds)
sdk_cmds = ['docker', 'exec', name, 'python3', '-m', 'pip', 'install', get_dist(wheel_name)]
check_call(sdk_cmds) check_call(sdk_cmds)
with open(source_dir + '/port', 'w') as file: with open(source_dir + '/port', 'w') as file:
file.write(str(port)) file.write(str(port))
...@@ -58,8 +73,9 @@ if __name__ == '__main__': ...@@ -58,8 +73,9 @@ if __name__ == '__main__':
parser.add_argument('--mode', required=True, choices=['start', 'stop'], dest='mode', help='start or stop a container') parser.add_argument('--mode', required=True, choices=['start', 'stop'], dest='mode', help='start or stop a container')
parser.add_argument('--name', required=True, dest='name', help='the name of container to be used') parser.add_argument('--name', required=True, dest='name', help='the name of container to be used')
parser.add_argument('--image', dest='image', help='the image to be used') parser.add_argument('--image', dest='image', help='the image to be used')
parser.add_argument('--os', dest='os', default='unix', choices=['unix', 'windows'], help='nniManager os version')
args = parser.parse_args() args = parser.parse_args()
if args.mode == 'start': if args.mode == 'start':
start_container(args.image, args.name) start_container(args.image, args.name, args.os)
else: else:
stop_container(args.name) stop_container(args.name)
$NNI_DEPENDENCY_FOLDER = [System.IO.Path]::GetTempPath()+$env:USERNAME
$NNI_DEPENDENCY_FOLDER = "C:\tmp\$env:USERNAME"
$env:PYTHONIOENCODING = "UTF-8" $env:PYTHONIOENCODING = "UTF-8"
if($env:VIRTUAL_ENV){ if($env:VIRTUAL_ENV){
...@@ -27,4 +26,4 @@ Remove-Item "src/nni_manager/node_modules" -Recurse -Force ...@@ -27,4 +26,4 @@ Remove-Item "src/nni_manager/node_modules" -Recurse -Force
Remove-Item "src/webui/build" -Recurse -Force Remove-Item "src/webui/build" -Recurse -Force
Remove-Item "src/webui/node_modules" -Recurse -Force Remove-Item "src/webui/node_modules" -Recurse -Force
Remove-Item $NNI_YARN_FOLDER -Recurse -Force Remove-Item $NNI_YARN_FOLDER -Recurse -Force
Remove-Item $NNI_NODE_FOLDER -Recurse -Force Remove-Item $NNI_NODE_FOLDER -Recurse -Force
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment