======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_cudnn_batchnorm_spatial_persistent', current_value=True, default_value=False) FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='') ======================================================================= I0727 11:19:01.148255 2863 tcp_utils.cc:107] Retry to connect to 127.0.0.1:52742 while the server is not yet listening. I0727 11:19:04.148411 2863 tcp_utils.cc:130] Successfully connected to 127.0.0.1:52742 I0727 11:19:04.179085 2863 process_group_nccl.cc:120] ProcessGroupNCCL pg_timeout_ 1800000 W0727 11:19:04.198071 2863 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 90.2, Driver API Version: 50724.2, Runtime API Version: 50724.2 eval model:: 0%| | 0/500 [00:00 >) ---------------------- Error Message Summary: ---------------------- FatalError: `Termination signal` is detected by the operating system. [TimeInfo: *** Aborted at 1722504077 (unix time) try "date -d @1722504077" if you are using GNU date ***] [SignalInfo: *** SIGTERM (@0x148) received by PID 416 (TID 0x7f116a10b640) from PID 328 ***] ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='') FLAGS(name='FLAGS_cudnn_batchnorm_spatial_persistent', current_value=True, default_value=False) ======================================================================= I0801 17:43:09.886485 743 tcp_utils.cc:107] Retry to connect to 10.8.145.246:62911 while the server is not yet listening. -------------------------------------- C++ Traceback (most recent call last): -------------------------------------- 0 phi::distributed::CreateOrGetGlobalTCPStore() 1 phi::distributed::TCPStore::TCPStore(std::string, unsigned short, bool, unsigned long, int) 2 phi::distributed::detail::TCPClient::connect(std::string, unsigned short) 3 phi::distributed::tcputils::tcp_connect(std::string, std::string, int, std::chrono::duration >) ---------------------- Error Message Summary: ---------------------- FatalError: `Termination signal` is detected by the operating system. [TimeInfo: *** Aborted at 1722505391 (unix time) try "date -d @1722505391" if you are using GNU date ***] [SignalInfo: *** SIGTERM (@0x28f) received by PID 743 (TID 0x7f4b3bf58740) from PID 655 ***] ======================= Modified FLAGS detected ======================= FLAGS(name='FLAGS_cudnn_batchnorm_spatial_persistent', current_value=True, default_value=False) FLAGS(name='FLAGS_selected_gpus', current_value='3', default_value='') ======================================================================= I0801 17:45:05.406978 1065 tcp_utils.cc:107] Retry to connect to 10.8.145.246:49295 while the server is not yet listening. I0801 17:45:08.407142 1065 tcp_utils.cc:130] Successfully connected to 10.8.145.246:49295 I0801 17:45:08.412381 1065 process_group_nccl.cc:120] ProcessGroupNCCL pg_timeout_ 1800000 W0801 17:45:08.423352 1065 gpu_resources.cc:119] Please NOTE: device: 3, GPU Compute Capability: 90.2, Driver API Version: 50724.2, Runtime API Version: 50724.2 eval model:: 0%| | 0/500 [00:00