fix the issue of tensorboard visualization
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/473 As shown in the attached image and tb visualization, some of our jobs fail to save the results to tensorboard. There should be some messages between circled lines of the screenshot if the images are added to tensorboard. One possible reason is that the tensorbord visualization evaluator is only added for the rank 0 gpu. It may fail to fetch any data during evaluation of diffusion model which only do 1 batch of inference during validataion. To resolve this issue, we add the visualization evaluator to all ranks of gpus and gather their results, and finally add the results with biggest batchsize to the tensorboard for visualization. The screenshot is from f410204704 (https://www.internalfb.com/manifold/explorer/mobile_vision_workflows/tree/workflows/xutao/20230211/latest_train/dalle2_decoder.SIULDLpgix/e2e_train/log.txt) Refactored the default_runner.py to have a new function _create_evaluators for create all evaluators. Thus we do not need to override the whole _do_test function in the runner which need to add the visualization evaluator of all ranks. (Note: this ignores all push blocking failures!) Reviewed By: YanjunChen329 Differential Revision: D43263543 fbshipit-source-id: eca2259277584819dcc5400d47fa4fb142f2ed9b
Showing
Please register or sign in to comment