Unverified Commit 7246593f authored by chicm-ms's avatar chicm-ms Committed by GitHub
Browse files

Fix multiphase issue with gridsearch tuner and batch tuner (#1539)

* Fix multiphase issue with gridsearch tuner and batch tuner
parent 41e58703
......@@ -8,8 +8,6 @@ Typically each trial job gets a single configuration (e.g., hyperparameters) fro
The above cases can be supported by the same feature, i.e., multi-phase execution. To support those cases, basically a trial job should be able to request multiple configurations from tuner. Tuner is aware of whether two configuration requests are from the same trial job or different ones. Also in multi-phase a trial job can report multiple final results.
Note that, `nni.get_next_parameter()` and `nni.report_final_result()` should be called sequentially: __call the former one, then call the later one; and repeat this pattern__. If `nni.get_next_parameter()` is called multiple times consecutively, and then `nni.report_final_result()` is called once, the result is associated to the last configuration, which is retrieved from the last get_next_parameter call. So there is no result associated to previous get_next_parameter calls, and it may cause some multi-phase algorithm broken.
## Create multi-phase experiment
### Write trial code which leverages multi-phase:
......@@ -23,6 +21,9 @@ It is pretty simple to use multi-phase in trial code, an example is shown below:
for i in range(5):
# get parameter from tuner
tuner_param = nni.get_next_parameter()
# nni.get_next_parameter returns None if there is no more hyper parameters can be generated by tuner.
if tuner_param is None:
break
# consume the params
# ...
......@@ -32,6 +33,10 @@ It is pretty simple to use multi-phase in trial code, an example is shown below:
# ...
```
In multi-phase experiments, at each time the API ```nni.get_next_parameter()``` is called, it returns a new hyper parameter generated by tuner, then the trail code consume this new hyper parameter and report final result of this hyper parameter. `nni.get_next_parameter()` and `nni.report_final_result()` should be called sequentially: __call the former one, then call the later one; and repeat this pattern__. If `nni.get_next_parameter()` is called multiple times consecutively, and then `nni.report_final_result()` is called once, the result is associated to the last configuration, which is retrieved from the last get_next_parameter call. So there is no result associated to previous get_next_parameter calls, and it may cause some multi-phase algorithm broken.
Note that, ```nni.get_next_parameter``` returns None if there is no more hyper parameters can be generated by tuner.
__2. Experiment configuration__
To enable multi-phase, you should also add `multiPhase: true` in your experiment YAML configure file. If this line is not added, `nni.get_next_parameter()` would always return the same configuration.
......
......@@ -691,8 +691,11 @@ class NNIManager implements Manager {
};
this.log.info(`updateTrialJob: job id: ${tunerCommand.trial_job_id}, form: ${JSON.stringify(trialJobForm)}`);
await this.trainingService.updateTrialJob(tunerCommand.trial_job_id, trialJobForm);
if (tunerCommand['parameters'] !== null) {
// parameters field is set as empty string if no more hyper parameter can be generated by tuner.
await this.dataStore.storeTrialJobEvent(
'ADD_HYPERPARAMETER', tunerCommand.trial_job_id, content, undefined);
}
break;
case NO_MORE_TRIAL_JOBS:
if (!['ERROR', 'STOPPING', 'STOPPED'].includes(this.status.status)) {
......
......@@ -22,6 +22,7 @@ import logging
from collections import defaultdict
import json_tricks
from nni import NoMoreTrialError
from .protocol import CommandType, send
from .msg_dispatcher_base import MsgDispatcherBase
from .assessor import AssessResult
......@@ -144,7 +145,10 @@ class MsgDispatcher(MsgDispatcherBase):
assert data['trial_job_id'] is not None
assert data['parameter_index'] is not None
param_id = _create_parameter_id()
try:
param = self.tuner.generate_parameters(param_id, trial_job_id=data['trial_job_id'])
except NoMoreTrialError:
param = None
send(CommandType.SendTrialJobParameter, _pack_parameter(param_id, param, trial_job_id=data['trial_job_id'], parameter_index=data['parameter_index']))
else:
raise ValueError('Data type not supported: {}'.format(data['type']))
......
......@@ -43,7 +43,8 @@ _sequence_id = platform.get_sequence_id()
def get_next_parameter():
"""Returns a set of (hyper-)paremeters generated by Tuner."""
"""Returns a set of (hyper-)paremeters generated by Tuner.
Returns None if no more (hyper-)parameters can be generated by Tuner."""
global _params
_params = platform.get_next_parameter()
if _params is None:
......
......@@ -4,5 +4,8 @@ import nni
if __name__ == '__main__':
for i in range(5):
hyper_params = nni.get_next_parameter()
print('hyper_params:[{}]'.format(hyper_params))
if hyper_params is None:
break
nni.report_final_result(0.1*i)
time.sleep(3)
......@@ -18,7 +18,7 @@ jobs:
displayName: 'generate config files'
- script: |
cd test
python config_test.py --ts local --local_gpu --exclude smac,bohb,multi_phase_batch,multi_phase_grid
python config_test.py --ts local --local_gpu --exclude smac,bohb
displayName: 'Examples and advanced features tests on local machine'
- script: |
cd test
......
......@@ -31,7 +31,7 @@ jobs:
displayName: 'Built-in tuners / assessors tests'
- script: |
cd test
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts local --local_gpu --exclude multi_phase_batch,multi_phase_grid
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts local --local_gpu
displayName: 'Examples and advanced features tests on local machine'
- script: |
cd test
......
......@@ -65,5 +65,5 @@ jobs:
python --version
python generate_ts_config.py --ts pai --pai_host $(pai_host) --pai_user $(pai_user) --pai_pwd $(pai_pwd) --vc $(pai_virtual_cluster) --nni_docker_image $(docker_image) --data_dir $(data_dir) --output_dir $(output_dir) --nni_manager_ip $(nni_manager_ip)
python config_test.py --ts pai --exclude multi_phase,smac,bohb,multi_phase_batch,multi_phase_grid
python config_test.py --ts pai --exclude multi_phase,smac,bohb
displayName: 'Examples and advanced features tests on pai'
\ No newline at end of file
......@@ -76,6 +76,6 @@ jobs:
python3 generate_ts_config.py --ts pai --pai_host $(pai_host) --pai_user $(pai_user) --pai_pwd $(pai_pwd) --vc $(pai_virtual_cluster) \
--nni_docker_image $TEST_IMG --data_dir $(data_dir) --output_dir $(output_dir) --nni_manager_ip $(nni_manager_ip)
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts pai --exclude multi_phase_batch,multi_phase_grid
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts pai
PATH=$HOME/.local/bin:$PATH python3 metrics_test.py
displayName: 'integration test'
......@@ -39,7 +39,7 @@ jobs:
cd test
python generate_ts_config.py --ts remote --remote_user $(docker_user) --remote_host $(remote_host) --remote_port $(Get-Content port) --remote_pwd $(docker_pwd) --nni_manager_ip $(nni_manager_ip)
Get-Content training_service.yml
python config_test.py --ts remote --exclude cifar10,smac,bohb,multi_phase_batch,multi_phase_grid
python config_test.py --ts remote --exclude cifar10,smac,bohb
displayName: 'integration test'
- task: SSH@0
inputs:
......
......@@ -53,7 +53,7 @@ jobs:
python3 generate_ts_config.py --ts remote --remote_user $(docker_user) --remote_host $(remote_host) \
--remote_port $(cat port) --remote_pwd $(docker_pwd) --nni_manager_ip $(nni_manager_ip)
cat training_service.yml
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts remote --exclude cifar10,multi_phase_batch,multi_phase_grid
PATH=$HOME/.local/bin:$PATH python3 config_test.py --ts remote --exclude cifar10
PATH=$HOME/.local/bin:$PATH python3 metrics_test.py
displayName: 'integration test'
- task: SSH@0
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment