"test/git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "088e7437c1b629a4f98f7e68702f20b9054d3daf"
Commit 1b9205c9 authored by yangzhong's avatar yangzhong
Browse files

v1.0

parents
Pipeline #2931 failed with stages
in 0 seconds
/root/.cache/wandb/logs/core-debug-20241216_104829.log
\ No newline at end of file
{"time":"2024-12-16T10:48:34.438350079+08:00","level":"INFO","msg":"using version","core version":"0.19.1"}
{"time":"2024-12-16T10:48:34.438360599+08:00","level":"INFO","msg":"created symlink","path":"/mnt/xgen-mm/LAVIS/wandb/run-20241216_104834-iey8t0re/logs/debug-core.log"}
{"time":"2024-12-16T10:48:34.548423887+08:00","level":"INFO","msg":"created new stream","id":"iey8t0re"}
{"time":"2024-12-16T10:48:34.548468636+08:00","level":"INFO","msg":"stream: started","id":"iey8t0re"}
{"time":"2024-12-16T10:48:34.548487791+08:00","level":"INFO","msg":"writer: Do: started","stream_id":"iey8t0re"}
{"time":"2024-12-16T10:48:34.548495322+08:00","level":"INFO","msg":"sender: started","stream_id":"iey8t0re"}
{"time":"2024-12-16T10:48:34.548523119+08:00","level":"INFO","msg":"handler: started","stream_id":"iey8t0re"}
{"time":"2024-12-16T10:48:44.551383213+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:37994->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T10:48:56.609144292+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:36384->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T10:49:10.850134383+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:45542->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T10:49:28.967945371+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:43370->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T10:49:57.438447961+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:33213->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T10:50:04.461575809+08:00","level":"INFO","msg":"stream: closing","id":"iey8t0re"}
{"time":"2024-12-16T10:50:04.461655516+08:00","level":"ERROR","msg":"sender: upsertRun:","error":"failed to upsert bucket: api: failed sending: context canceled"}
{"time":"2024-12-16T10:50:04.461679932+08:00","level":"INFO","msg":"handler: closed","stream_id":"iey8t0re"}
{"time":"2024-12-16T10:50:04.461690188+08:00","level":"INFO","msg":"writer: Close: closed","stream_id":"iey8t0re"}
{"time":"2024-12-16T10:50:04.461739964+08:00","level":"WARN","msg":"runwork: ignoring record after close","work":{"Record":{"RecordType":{"Request":{"RequestType":{"Defer":{}}}},"control":{"always_send":true}}}}
{"time":"2024-12-16T10:50:04.461870979+08:00","level":"INFO","msg":"sender: closed","stream_id":"iey8t0re"}
{"time":"2024-12-16T10:50:04.462185417+08:00","level":"INFO","msg":"stream: closed","id":"iey8t0re"}
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_setup.py:_flush():68] Current SDK version is 0.19.1
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_setup.py:_flush():68] Configure stats pid to 6275
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_setup.py:_flush():68] Loading settings from /root/.config/wandb/settings
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_setup.py:_flush():68] Loading settings from /mnt/xgen-mm/LAVIS/wandb/settings
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_setup.py:_flush():68] Loading settings from environment variables
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_init.py:_log_setup():528] Logging user logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_104834-iey8t0re/logs/debug.log
2024-12-16 10:48:34,432 INFO MainThread:6275 [wandb_init.py:_log_setup():529] Logging internal logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_104834-iey8t0re/logs/debug-internal.log
2024-12-16 10:48:34,433 INFO MainThread:6275 [wandb_init.py:init():644] calling init triggers
2024-12-16 10:48:34,433 INFO MainThread:6275 [wandb_init.py:init():650] wandb.init called with sweep_config: {}
config: {'model_family': 'xgenmm_v1', 'vision_encoder_path': 'google/siglip-so400m-patch14-384', 'vision_encoder_pretrained': 'google', 'lm_path': 'microsoft/Phi-3-mini-4k-instruct', 'tokenizer_path': 'microsoft/Phi-3-mini-4k-instruct', 'cross_attn_every_n_layers': 1, 'num_vision_tokens': 128, 'pretrained': '/mnt/xgen-mm/xgen-mm-phi3-mini-base-r-v1.5.pt', 'pretrained_vision_tokenizer': None, 'loss': 'supervised_finetune', 'run_name': 'finetune-xgenmmv1-phi3_4k_instruct-example_data_config', 'resume_from_checkpoint': None, 'delete_previous_checkpoint': False, 'no_save_optim_state': True, 'gradient_accumulation_steps': 1, 'seed': 42, 'learning_rate': 2e-05, 'lr_scheduler': 'cosine', 'warmup_steps': 2000, 'weight_decay': 0.0, 'precision': 'amp_bf16', 'gradient_checkpointing': True, 'num_epochs': 1, 'offline': False, 'logging_steps': 100, 'checkpoint_steps': 5000, 'data_path': '/mnt/xgen-mm/LAVIS/data_configs/example_data_config.yaml', 'batch_size': 8, 'workers': 4, 'data_sampler_group_by_length': True, 'is_multimodal': True, 'mm_use_im_start_end': False, 'conv_template_name': 'phi_3', 'image_aspect_ratio': 'anyres', 'anyres_patch_sampling': True, 'anyres_grids': [(1, 2), (2, 1), (2, 2), (3, 1), (1, 3)], 'dist_url': 'env://', 'dist_backend': 'nccl', 'horovod': False, 'no_set_device_rank': False, 'local_rank': 0, 'fsdp': True, 'fsdp_sharding_strategy': 'hybrid', 'report_to_wandb': True, 'wandb_project': 'blip3-xgenmm-finetune', 'wandb_entity': None, 'save_checkpoints_to_wandb': False, 'dryrun': False, 'use_flash_attention_2': False, 'unfreeze_vision_encoder': False, 'vision_encoder_precision': 'fp32', 'cpu_offload_gradients': False, 'rank': 0, 'world_size': 8, 'distributed': True, 'device': 'cuda:0'}
2024-12-16 10:48:34,433 INFO MainThread:6275 [wandb_init.py:init():680] starting backend
2024-12-16 10:48:34,433 INFO MainThread:6275 [wandb_init.py:init():684] sending inform_init request
2024-12-16 10:48:34,436 INFO MainThread:6275 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-12-16 10:48:34,437 INFO MainThread:6275 [wandb_init.py:init():697] backend started and connected
2024-12-16 10:48:34,438 INFO MainThread:6275 [wandb_init.py:init():790] updated telemetry
2024-12-16 10:48:34,444 INFO MainThread:6275 [wandb_init.py:init():822] communicating run to backend with 90.0 second timeout
2024-12-16 10:49:02,634 INFO Thread-2 (wrapped_target):6275 [retry.py:__call__():172] Retry attempt failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request
raise new_e
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f01ff1195d0>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f01ff1195d0>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/retry.py", line 131, in __call__
result = self._call_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/internal/internal_api.py", line 393, in execute
return self.client.execute(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/gql_request.py", line 58, in execute
request = self.session.post(self.url, **post_args)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 700, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f01ff1195d0>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)"))
2024-12-16 10:50:04,455 ERROR MainThread:6275 [wandb_init.py:init():849] encountered error: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
2024-12-16 10:50:04,456 ERROR MainThread:6275 [wandb_init.py:init():1308] error in wandb.init()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1298, in init
return wi.init()
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 855, in init
raise error
wandb.errors.errors.CommError: Run initialization has timed out after 90.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
2024-12-16 10:50:04,461 WARNING MsgRouterThr:6275 [router.py:message_loop():75] message_loop has been closed
/root/.cache/wandb/logs/core-debug-20241216_112319.log
\ No newline at end of file
{"time":"2024-12-16T11:23:24.302100605+08:00","level":"INFO","msg":"using version","core version":"0.19.1"}
{"time":"2024-12-16T11:23:24.302110279+08:00","level":"INFO","msg":"created symlink","path":"/mnt/xgen-mm/LAVIS/wandb/run-20241216_112324-4ze48ky3/logs/debug-core.log"}
{"time":"2024-12-16T11:23:24.41159067+08:00","level":"INFO","msg":"created new stream","id":"4ze48ky3"}
{"time":"2024-12-16T11:23:24.411647739+08:00","level":"INFO","msg":"stream: started","id":"4ze48ky3"}
{"time":"2024-12-16T11:23:24.41166882+08:00","level":"INFO","msg":"handler: started","stream_id":"4ze48ky3"}
{"time":"2024-12-16T11:23:24.411670642+08:00","level":"INFO","msg":"writer: Do: started","stream_id":"4ze48ky3"}
{"time":"2024-12-16T11:23:24.411672074+08:00","level":"INFO","msg":"sender: started","stream_id":"4ze48ky3"}
{"time":"2024-12-16T11:23:34.413993218+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:57208->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:23:46.44082274+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:51728->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:24:00.855908789+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:38629->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:24:19.253955419+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:47267->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:24:48.012929792+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:58083->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:25:24.335768305+08:00","level":"INFO","msg":"stream: closing","id":"4ze48ky3"}
{"time":"2024-12-16T11:25:24.335812019+08:00","level":"INFO","msg":"handler: closed","stream_id":"4ze48ky3"}
{"time":"2024-12-16T11:25:24.335836815+08:00","level":"INFO","msg":"writer: Close: closed","stream_id":"4ze48ky3"}
{"time":"2024-12-16T11:25:24.335872709+08:00","level":"ERROR","msg":"sender: upsertRun:","error":"failed to upsert bucket: api: failed sending: POST https://api.wandb.ai/graphql giving up after 6 attempt(s): context canceled"}
{"time":"2024-12-16T11:25:24.335902062+08:00","level":"WARN","msg":"runwork: ignoring record after close","work":{"Record":{"RecordType":{"Request":{"RequestType":{"Defer":{}}}},"control":{"always_send":true}}}}
{"time":"2024-12-16T11:25:24.336008532+08:00","level":"INFO","msg":"sender: closed","stream_id":"4ze48ky3"}
{"time":"2024-12-16T11:25:24.336226132+08:00","level":"INFO","msg":"stream: closed","id":"4ze48ky3"}
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_setup.py:_flush():68] Current SDK version is 0.19.1
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_setup.py:_flush():68] Configure stats pid to 8851
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_setup.py:_flush():68] Loading settings from /root/.config/wandb/settings
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_setup.py:_flush():68] Loading settings from /mnt/xgen-mm/LAVIS/wandb/settings
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_setup.py:_flush():68] Loading settings from environment variables
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_init.py:_log_setup():528] Logging user logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_112324-4ze48ky3/logs/debug.log
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_init.py:_log_setup():529] Logging internal logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_112324-4ze48ky3/logs/debug-internal.log
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_init.py:init():644] calling init triggers
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_init.py:init():650] wandb.init called with sweep_config: {}
config: {'model_family': 'xgenmm_v1', 'vision_encoder_path': 'google/siglip-so400m-patch14-384', 'vision_encoder_pretrained': 'google', 'lm_path': 'microsoft/Phi-3-mini-4k-instruct', 'tokenizer_path': 'microsoft/Phi-3-mini-4k-instruct', 'cross_attn_every_n_layers': 1, 'num_vision_tokens': 128, 'pretrained': '/mnt/xgen-mm/xgen-mm-phi3-mini-base-r-v1.5.pt', 'pretrained_vision_tokenizer': None, 'loss': 'supervised_finetune', 'run_name': 'finetune-xgenmmv1-phi3_4k_instruct-example_data_config', 'resume_from_checkpoint': None, 'delete_previous_checkpoint': False, 'no_save_optim_state': True, 'gradient_accumulation_steps': 1, 'seed': 42, 'learning_rate': 2e-05, 'lr_scheduler': 'cosine', 'warmup_steps': 2000, 'weight_decay': 0.0, 'precision': 'amp_bf16', 'gradient_checkpointing': True, 'num_epochs': 1, 'offline': False, 'logging_steps': 100, 'checkpoint_steps': 5000, 'data_path': '/mnt/xgen-mm/LAVIS/data_configs/example_data_config.yaml', 'batch_size': 8, 'workers': 4, 'data_sampler_group_by_length': True, 'is_multimodal': True, 'mm_use_im_start_end': False, 'conv_template_name': 'phi_3', 'image_aspect_ratio': 'anyres', 'anyres_patch_sampling': True, 'anyres_grids': [(1, 2), (2, 1), (2, 2), (3, 1), (1, 3)], 'dist_url': 'env://', 'dist_backend': 'nccl', 'horovod': False, 'no_set_device_rank': False, 'local_rank': 0, 'fsdp': True, 'fsdp_sharding_strategy': 'hybrid', 'report_to_wandb': True, 'wandb_project': 'blip3-xgenmm-finetune', 'wandb_entity': None, 'save_checkpoints_to_wandb': False, 'dryrun': False, 'use_flash_attention_2': False, 'unfreeze_vision_encoder': False, 'vision_encoder_precision': 'fp32', 'cpu_offload_gradients': False, 'rank': 0, 'world_size': 8, 'distributed': True, 'device': 'cuda:0'}
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_init.py:init():680] starting backend
2024-12-16 11:23:24,296 INFO MainThread:8851 [wandb_init.py:init():684] sending inform_init request
2024-12-16 11:23:24,300 INFO MainThread:8851 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-12-16 11:23:24,300 INFO MainThread:8851 [wandb_init.py:init():697] backend started and connected
2024-12-16 11:23:24,303 INFO MainThread:8851 [wandb_init.py:init():790] updated telemetry
2024-12-16 11:23:24,309 INFO MainThread:8851 [wandb_init.py:init():822] communicating run to backend with 120.0 second timeout
2024-12-16 11:23:52,498 INFO Thread-2 (wrapped_target):8851 [retry.py:__call__():172] Retry attempt failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request
raise new_e
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f4f5a5315a0>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f4f5a5315a0>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/retry.py", line 131, in __call__
result = self._call_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/internal/internal_api.py", line 393, in execute
return self.client.execute(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/gql_request.py", line 58, in execute
request = self.session.post(self.url, **post_args)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 700, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f4f5a5315a0>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)"))
2024-12-16 11:25:24,328 ERROR MainThread:8851 [wandb_init.py:init():849] encountered error: Run initialization has timed out after 120.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
2024-12-16 11:25:24,329 ERROR MainThread:8851 [wandb_init.py:init():1308] error in wandb.init()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1298, in init
return wi.init()
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 855, in init
raise error
wandb.errors.errors.CommError: Run initialization has timed out after 120.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
2024-12-16 11:25:24,335 WARNING MsgRouterThr:8851 [router.py:message_loop():75] message_loop has been closed
/root/.cache/wandb/logs/core-debug-20241216_113415.log
\ No newline at end of file
{"time":"2024-12-16T11:34:20.593727629+08:00","level":"INFO","msg":"using version","core version":"0.19.1"}
{"time":"2024-12-16T11:34:20.593737734+08:00","level":"INFO","msg":"created symlink","path":"/mnt/xgen-mm/LAVIS/wandb/run-20241216_113420-xvp9nqy9/logs/debug-core.log"}
{"time":"2024-12-16T11:34:20.703170654+08:00","level":"INFO","msg":"created new stream","id":"xvp9nqy9"}
{"time":"2024-12-16T11:34:20.703212891+08:00","level":"INFO","msg":"stream: started","id":"xvp9nqy9"}
{"time":"2024-12-16T11:34:20.703229602+08:00","level":"INFO","msg":"handler: started","stream_id":"xvp9nqy9"}
{"time":"2024-12-16T11:34:20.703227695+08:00","level":"INFO","msg":"writer: Do: started","stream_id":"xvp9nqy9"}
{"time":"2024-12-16T11:34:20.70325138+08:00","level":"INFO","msg":"sender: started","stream_id":"xvp9nqy9"}
{"time":"2024-12-16T11:34:30.706322806+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:35753->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:34:42.710110122+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:54475->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:34:56.93263181+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:34100->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:35:16.263212461+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:55712->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:35:44.386318183+08:00","level":"INFO","msg":"api: retrying error","error":"Post \"https://api.wandb.ai/graphql\": dial tcp: lookup api.wandb.ai on 114.114.114.114:53: read udp 10.16.4.6:52173->114.114.114.114:53: i/o timeout"}
{"time":"2024-12-16T11:36:20.624377562+08:00","level":"INFO","msg":"stream: closing","id":"xvp9nqy9"}
{"time":"2024-12-16T11:36:20.624420999+08:00","level":"INFO","msg":"handler: closed","stream_id":"xvp9nqy9"}
{"time":"2024-12-16T11:36:20.624427472+08:00","level":"ERROR","msg":"sender: upsertRun:","error":"failed to upsert bucket: api: failed sending: context canceled"}
{"time":"2024-12-16T11:36:20.624444635+08:00","level":"INFO","msg":"writer: Close: closed","stream_id":"xvp9nqy9"}
{"time":"2024-12-16T11:36:20.624466627+08:00","level":"WARN","msg":"runwork: ignoring record after close","work":{"Record":{"RecordType":{"Request":{"RequestType":{"Defer":{}}}},"control":{"always_send":true}}}}
{"time":"2024-12-16T11:36:20.624596137+08:00","level":"INFO","msg":"sender: closed","stream_id":"xvp9nqy9"}
{"time":"2024-12-16T11:36:20.624816752+08:00","level":"INFO","msg":"stream: closed","id":"xvp9nqy9"}
2024-12-16 11:34:20,587 INFO MainThread:9913 [wandb_setup.py:_flush():68] Current SDK version is 0.19.1
2024-12-16 11:34:20,587 INFO MainThread:9913 [wandb_setup.py:_flush():68] Configure stats pid to 9913
2024-12-16 11:34:20,587 INFO MainThread:9913 [wandb_setup.py:_flush():68] Loading settings from /root/.config/wandb/settings
2024-12-16 11:34:20,587 INFO MainThread:9913 [wandb_setup.py:_flush():68] Loading settings from /mnt/xgen-mm/LAVIS/wandb/settings
2024-12-16 11:34:20,587 INFO MainThread:9913 [wandb_setup.py:_flush():68] Loading settings from environment variables
2024-12-16 11:34:20,587 INFO MainThread:9913 [wandb_init.py:_log_setup():528] Logging user logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_113420-xvp9nqy9/logs/debug.log
2024-12-16 11:34:20,588 INFO MainThread:9913 [wandb_init.py:_log_setup():529] Logging internal logs to /mnt/xgen-mm/LAVIS/wandb/run-20241216_113420-xvp9nqy9/logs/debug-internal.log
2024-12-16 11:34:20,588 INFO MainThread:9913 [wandb_init.py:init():644] calling init triggers
2024-12-16 11:34:20,588 INFO MainThread:9913 [wandb_init.py:init():650] wandb.init called with sweep_config: {}
config: {'model_family': 'xgenmm_v1', 'vision_encoder_path': 'google/siglip-so400m-patch14-384', 'vision_encoder_pretrained': 'google', 'lm_path': 'microsoft/Phi-3-mini-4k-instruct', 'tokenizer_path': 'microsoft/Phi-3-mini-4k-instruct', 'cross_attn_every_n_layers': 1, 'num_vision_tokens': 128, 'pretrained': '/mnt/xgen-mm/xgen-mm-phi3-mini-base-r-v1.5.pt', 'pretrained_vision_tokenizer': None, 'loss': 'supervised_finetune', 'run_name': 'finetune-xgenmmv1-phi3_4k_instruct-example_data_config', 'resume_from_checkpoint': None, 'delete_previous_checkpoint': False, 'no_save_optim_state': True, 'gradient_accumulation_steps': 1, 'seed': 42, 'learning_rate': 2e-05, 'lr_scheduler': 'cosine', 'warmup_steps': 2000, 'weight_decay': 0.0, 'precision': 'amp_bf16', 'gradient_checkpointing': True, 'num_epochs': 1, 'offline': False, 'logging_steps': 100, 'checkpoint_steps': 5000, 'data_path': '/mnt/xgen-mm/LAVIS/data_configs/example_data_config.yaml', 'batch_size': 8, 'workers': 4, 'data_sampler_group_by_length': True, 'is_multimodal': True, 'mm_use_im_start_end': False, 'conv_template_name': 'phi_3', 'image_aspect_ratio': 'anyres', 'anyres_patch_sampling': True, 'anyres_grids': [(1, 2), (2, 1), (2, 2), (3, 1), (1, 3)], 'dist_url': 'env://', 'dist_backend': 'nccl', 'horovod': False, 'no_set_device_rank': False, 'local_rank': 0, 'fsdp': True, 'fsdp_sharding_strategy': 'hybrid', 'report_to_wandb': True, 'wandb_project': 'blip3-xgenmm-finetune', 'wandb_entity': None, 'save_checkpoints_to_wandb': False, 'dryrun': False, 'use_flash_attention_2': False, 'unfreeze_vision_encoder': False, 'vision_encoder_precision': 'fp32', 'cpu_offload_gradients': False, 'rank': 0, 'world_size': 8, 'distributed': True, 'device': 'cuda:0'}
2024-12-16 11:34:20,588 INFO MainThread:9913 [wandb_init.py:init():680] starting backend
2024-12-16 11:34:20,588 INFO MainThread:9913 [wandb_init.py:init():684] sending inform_init request
2024-12-16 11:34:20,592 INFO MainThread:9913 [backend.py:_multiprocessing_setup():104] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-12-16 11:34:20,592 INFO MainThread:9913 [wandb_init.py:init():697] backend started and connected
2024-12-16 11:34:20,594 INFO MainThread:9913 [wandb_init.py:init():790] updated telemetry
2024-12-16 11:34:20,601 INFO MainThread:9913 [wandb_init.py:init():822] communicating run to backend with 120.0 second timeout
2024-12-16 11:34:48,790 INFO Thread-2 (wrapped_target):9913 [retry.py:__call__():172] Retry attempt failed:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 60, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/lib/python3.10/socket.py", line 955, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 490, in _make_request
raise new_e
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 466, in _make_request
self._validate_conn(conn)
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
conn.connect()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 693, in connect
self.sock = sock = self._new_conn()
File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 206, in _new_conn
raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f72d2485750>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f72d2485750>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)"))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/retry.py", line 131, in __call__
result = self._call_fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/internal/internal_api.py", line 393, in execute
return self.client.execute(*args, **kwargs) # type: ignore
File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 52, in execute
result = self._get_result(document, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/vendor/gql-0.2.0/wandb_gql/client.py", line 60, in _get_result
return self.transport.execute(document, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/lib/gql_request.py", line 58, in execute
request = self.session.post(self.url, **post_args)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 700, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f72d2485750>: Failed to resolve 'api.wandb.ai' ([Errno -3] Temporary failure in name resolution)"))
2024-12-16 11:36:20,618 ERROR MainThread:9913 [wandb_init.py:init():849] encountered error: Run initialization has timed out after 120.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
2024-12-16 11:36:20,618 ERROR MainThread:9913 [wandb_init.py:init():1308] error in wandb.init()
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1298, in init
return wi.init()
File "/usr/local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 855, in init
raise error
wandb.errors.errors.CommError: Run initialization has timed out after 120.0 sec. Please try increasing the timeout with the `init_timeout` setting: `wandb.init(settings=wandb.Settings(init_timeout=120))`.
2024-12-16 11:36:20,624 WARNING MsgRouterThr:9913 [router.py:message_loop():75] message_loop has been closed
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment