2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_setup.py:_flush():68] Current SDK version is 0.19.1 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_setup.py:_flush():68] Configure stats pid to 1952 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_setup.py:_flush():68] Loading settings from /root/.config/wandb/settings 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_setup.py:_flush():68] Loading settings from /mnt/xgen-mm/LAVIS/wandb/settings 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_setup.py:_flush():68] Loading settings from environment variables 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_init.py:_log_setup():528] Logging user logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_101034-bluz2d3p/logs/debug.log 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_init.py:_log_setup():529] Logging internal logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_101034-bluz2d3p/logs/debug-internal.log 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_init.py:init():644] calling init triggers 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_init.py:init():650] wandb.init called with sweep_config: {} config: {'model_family': 'xgenmm_v1', 'vision_encoder_path': 'google/siglip-so400m-patch14-384', 'vision_encoder_pretrained': 'google', 'lm_path': 'microsoft/Phi-3-mini-4k-instruct', 'tokenizer_path': 'microsoft/Phi-3-mini-4k-instruct', 'cross_attn_every_n_layers': 1, 'num_vision_tokens': 128, 'pretrained': '/mnt/xgen-mm/xgen-mm-phi3-mini-base-r-v1.5.pt', 'pretrained_vision_tokenizer': None, 'loss': 'supervised_finetune', 'run_name': 'finetune-xgenmmv1-phi3_4k_instruct-example_data_config', 'resume_from_checkpoint': None, 'delete_previous_checkpoint': False, 'no_save_optim_state': True, 'gradient_accumulation_steps': 1, 'seed': 42, 'learning_rate': 2e-05, 'lr_scheduler': 'cosine', 'warmup_steps': 2000, 'weight_decay': 0.0, 'precision': 'amp_bf16', 'gradient_checkpointing': True, 'num_epochs': 1, 'offline': False, 'logging_steps': 100, 'checkpoint_steps': 5000, 'data_path': '/mnt/xgen-mm/LAVIS/data_configs/example_data_config.yaml', 'batch_size': 8, 'workers': 4, 'data_sampler_group_by_length': True, 'is_multimodal': True, 'mm_use_im_start_end': False, 'conv_template_name': 'phi_3', 'image_aspect_ratio': 'anyres', 'anyres_patch_sampling': True, 'anyres_grids': [(1, 2), (2, 1), (2, 2), (3, 1), (1, 3)], 'dist_url': 'env://', 'dist_backend': 'nccl', 'horovod': False, 'no_set_device_rank': False, 'local_rank': 0, 'fsdp': True, 'fsdp_sharding_strategy': 'hybrid', 'report_to_wandb': True, 'wandb_project': 'blip3-xgenmm-finetune', 'wandb_entity': None, 'save_checkpoints_to_wandb': False, 'dryrun': False, 'use_flash_attention_2': False, 'unfreeze_vision_encoder': False, 'vision_encoder_precision': 'fp32', 'cpu_offload_gradients': False, 'rank': 0, 'world_size': 8, 'distributed': True, 'device': 'cuda:0'} 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_init.py:init():680] starting backend 2024-12-16 10:10:34,054 INFO MainThread:1952 [wandb_init.py:init():684] sending inform_init request 2024-12-16 10:10:34,058 INFO MainThread:1952 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=fork,spawn,forkserver, using: spawn 2024-12-16 10:10:34,058 INFO MainThread:1952 [wandb_init.py:init():697] backend started and connected 2024-12-16 10:10:34,061 INFO MainThread:1952 [wandb_init.py:init():790] updated telemetry 2024-12-16 10:10:34,068 INFO MainThread:1952 [wandb_init.py:init():822] communicating run to backend with 90.0 second timeout 2024-12-16 10:11:02,254 INFO Thread-2 (wrapped_target):1952 [retry.py:__call__():172] Retry attempt failed: Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn sock = connection.create_connection( File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -3] Temporary failure in name resolution The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen response = self._make_request( File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request raise new_e File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request self._validate_conn(conn) File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn conn.connect() File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect self.sock = sock = self._new_conn() File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn raise NameResolutionError(self.host, self, e) from e urllib3.exceptions.NameResolutionError: : Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 667, in send resp = conn.urlopen( File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen retries = retries.increment( File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError(": Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)")) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/retry.py", line 131, in __call__ result = self._call_fn(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/wandb/sdk/internal/internal_api.py", line 393, in execute return self.client.execute(*args, **kwargs) # type: ignore File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute result = self._get_result(document, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result return self.transport.execute(document, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/gql_request.py", line 58, in execute request = self.session.post(self.url, **post_args) File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post return self.request("POST", url, data=data, json=json, **kwargs) File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 700, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError(": Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)")) 2024-12-16 10:12:04,084 ERROR MainThread:1952 [wandb_init.py:init():849] encountered error: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`. 2024-12-16 10:12:04,085 ERROR MainThread:1952 [wandb_init.py:init():1308] error in wandb.init() Traceback (most recent call last): File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1298, in init return wi.init() File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 855, in init raise error wandb.errors.errors.CommError: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`. 2024-12-16 10:12:04,091 WARNING MsgRouterThr:1952 [router.py:message_loop():75] message_loop has been closed